SlideShare uma empresa Scribd logo
1 de 54
TOYOTA Machine Learning on
Apache SparkTM
FEATURE
ENGINEERING
Brian Kursar, Sr Data Scientist
Toyota Motor Sales IT Research and Development
Big Data Day LA 2015
Big Data
Big Data 1956
Big Data
Increased Storage Capacity Faster Processing
TOYOTA Big Data History
2015
2014
2013
2012
2011
2010
C360 - Next Gen Insights Platform
Over 6B Records
C360 - Customer Experience Analytics
Over 700M Records
C360 - Toyota Social Media Intelligence Center
Over 500M Records
Product Quality Analytics v2
Over 120M Records
Marketing and Incentives Analytics
70M Records
Product Quality Analytics
Over 60M Records
Sentiment Analysis
Basic Sentiment Analysis is not enough
Existing Tools
Jan – Feb Feb - Mar
+1% +1%
+1% +2%
Toyota Social Opinion 2013
WHY?
It doesn’t give you the “Why”
40% Retailers Selling Toyota Vehicles
11% Opinions on Marketing Campaigns
10% Feedback on Dealer Sales and Service Experiences
9% Opinions on Product Styling and Features
8% People In Market for a Toyota
8% Incident Reports Involving a Toyota Vehicle
7% Feedback on Product Quality
5% Customers Advocating for the Brand
2% Completely Irrelevant
Toyota Online Conversations by the Numbers
2014
Study
Toyota Online Conversations by the Numbers
40% Retailers Selling Toyota Vehicles
11% Opinions on Marketing Campaigns
10% Feedback on Dealer Sales and Service Experiences
9% Opinions on Product Styling and Features
8% People In Market for a Toyota
8% Incident Reports Involving a Toyota Vehicle
7% Feedback on Product Quality
5% Customers Advocating for the Brand
2% Completely Irrelevant
2014
Study
50%
Noise
Millions of Social Media posts a day and not
enough Resources to read them all
Problem Statement
Categorize and Prioritize incoming Social Media
interactions in Real-Time using Machine Learning to
provide ACTIONABLE INSIGHTS
Campaign
Opinions
Customer
Feedback
Product
Feedback Noise
Technology Opportunities
Is this the image that executives at your company have when they hear
the words “Machine Learning?” If so, help make it relative.
First Spark MLlib Experiment
• Seat Cover Wrinkles/Cracking
• Brake Noise
• Shift Quality
• Oil Leaks
• HVAC Odor
• Dead Battery
• Rodent Wire Harness Damage
• Paint Chips
Time-box project to 12 Weeks
Classify at min 80% accuracy
How does one
find training data
for a noise they
have never
personally
experienced?
NoiseBrakes
If it was as easy as using key words “Brake” and “Noise”
then why bother use Machine Learning?
Where can we find categorized specific Product Quality
Concerns reported by Customers to use as Training Data?
Challenge
Mechanical Turk
The Amazon Mechanical Turk (MTurk) is a crowdsourcing Internet marketplace
that enables individuals and businesses (known as Requesters) to coordinate
the use of human intelligence to perform tasks that computers are currently
unable to do.
Mechanical Turk
Define Social data
with Keywords
“Brake” AND Toyota
Product line (i.e.
Prius, Camry, Avalon,
etc.)
Define
Leverage results from
Mechanical Turk as
positive and negative
training sets.
Train
Create Human
Intelligence Tasks
“Is this comment
indicative of a
problem concerning
Brake Noise?”
Y
N
Create
When I’m backing up in my 2012 Prius, it sounds like
something hanging up or scraping as it rotates and only
happens in the morning..
I hear a squeak coming from the back wheel of
my Prius as I pull out from my driveway in the
morning.
True positives missed by workers in labeling due to lack of
experience with the problem they are looking to identify.
Leverage Similar Internal Datasets as Training Data
Internal Data can help build Training set with relevant Features
Social ML Pipeline
Hand Labeled
Social
Labeled Internal
Survey
Responses
Featurizer
Chi-Square Feature Selector
All
Random
Top Popular
Top Random
N-Grams
Cleanse
N-Gram
Extraction
Stemming
Stop
Words
Natural Language Processing
Training Data
Extract Text Features
Statistics.chiSqTest(vec)
Social ML Pipeline
Training Data
Hand Labeled
Social
Cleanse
N-Gram
Extraction
Stemming
Stop
Words
Synonyms POS
Natural Language Processing
Labeled Survey
Responses
Featurizer
Chi-Square Feature Selector
All
Random
Top Popular
Top Random
N-Grams
Support Vector Machines
Training Selection Filters
Training Set
9 Fold
Validation Set
1 Fold
Vectorizer
TF*IDF
TF Options
Natural
Boolean
Log
Augmented
IDF Options
Unary
Inverse
Inv Smooth
ProbDF
Features
Train Predictive Model
Social ML Pipeline
Training Data
Hand Labeled
Social
Cleanse
N-Gram
Extraction
Stemming
Stop
Words
Synonyms POS
Natural Language Processing
Labeled Survey
Responses
Featurizer
Chi-Square Feature Selector
All
Random
Top Popular
Top Random
N-Grams
Support Vector Machines
Training Selection Filters
Training Set
9 Fold
Validation Set
1 Fold
Vectorizer
TF*IDF
TF Options
Natural
Boolean
Log
Augmented
IDF Options
Unary
Inverse
Inv Smooth
ProbDF
Features
Model
Ver 1
56%
Accuracy
Ver 3
36%
Accuracy
Ver 8
35%
Accuracy
False Positives
I just had my friend at the toyota dealer rotate my tires and he
said … that the brake pads are getting thin really fast.
So what should I do when they get too thin in the future and start
to squeak?
False Positives
i cut the iac hose as shown in figure 20 in the manual but when i
start the car, it started gasping for air... choking...
sounds like it's about to die out.
i bought the power brake check valve (80190 part for
kragen)... but either i'm not installing it right or it's the wrong
size... i have no idea.
Solution
Explicit Semantic Analysis
“word” ---> <concept1, weight1>, <concept2, weight2>,
<concept3, weight3> REPAIR MANUAL
NoiseBrakes
Define Concepts
Noise
caliper
pads
rotor
wheel
squeak grind
groan
squeal
Brakes
drum
Build Concepts
Noise
squeak grind
groan
squeal
Measure Distance Similarity Between Concepts
pads
caliper
rotor
wheel
drum
Brakes
Distance is calculated between
concepts based on the
Minkowski distance formula.
Ver 9
82%
Accuracy
Kaizen
= Continuous Improvement
Kaizen
Train
TestEvaluate
Refine
Social ML Pipeline
Training Data
Hand Labeled
Social
Cleanse
N-Gram
Extraction
Stemming
Stop
Words
Natural Language Processing
Labeled Survey
Responses
Vectorizer
TF*IDF
TF Options
Natural
Boolean
Log
Augmented
IDF Options
Unary
Inverse
Inv Smooth
ProbDF
Featurizer
Chi-Square Feature Selector
All
Random
Top Popular
Top Random
Support Vector Machines
Training Selection Filters
Training Set
9 Fold
Validation Set
1 Fold
N-Grams
Model
Synonyms POS
Distance Similarity
Euclidean
Distance (L2)
-
√(|x|^ 2 + |y|^ 2)
Manhattan
Distance (L1)
-
|x|+ |y|
Concept Manager
Concept Interpreter
Negation Evaluator
TODAY
• Education and Inclusion
TEAM TOYOTA TIPS
TO STEM OR NOT TO STEM
noun verb
When does stemming cause your entities to change?
TO STEM OR NOT TO STEM
noun noun
Will stemming produce false positives?
Negation – to contradict or deny something.
NEGATION
Great work done in medical document retrieval can be
leveraged. How different is a sick person from a sick car?
• absence
• did not
• didn't
• doesn't
• isn’t
• neither
• never
• no
• none
• without
NEGATION
did not
absence
no isn’t
none
Negation
never
I a am an and any my no the took
N-GRAMS AND STOP WORDS
Generic definition - the most common words in a language.
• For some models, Unigrams (single word) can be problematic due to lack of adjacent term which may assist in
disambiguation or could indicate a voice.
• Helps to keep possessive nouns especially in the feature set versus out via a stop list. Especially for Twitter
data. It helps identify a “Voice” versus marketing noise.
I a am an and any my no the took
• Each Model should have its own carefully selected stop words list
• Utilize unimportant entities in Stop Words Lists
N-GRAMS AND STOP LISTS
"interaction": {
"source": "web",
"author": {
"username": "johndoe",
"name": "John Doe",
“id": 10750902,
“parentid": 10750901,
"avatar": "http://a0.twimg.com/profile_images/1111111111/example.jpeg"
"link": "http://twitter.com/johndoe"
},
"type": “reply",
"created_at": "Fri, 17 Aug 2012 14:13:08 +0000",
"content": "I like ice cream!",
"id": "1e1e875ab43fa233e074337458bc1dca",
"link": "http://twitter.com/johndoe/statuses/111111111111111111",
"geo": {
"latitude": 42.376104,
"longitude": -71.237189
}
},
"twitter": {
"created_at": "Fri, 17 Aug 2012 14:13:08 +0000"
},
"demographic": {
"gender": "mostly_male"
}
CONTEXT IS KING
},
• Not all Social Media Interactions are
equal
• Some offer better Metadata than
others
• Leverage relevant Metadata as
features, feature weights, or filters
"interaction": {
"source": "web",
"author": {
"username": "johndoe",
"name": "John Doe",
“id": 10750902,
“parentid": 10750901,
"avatar": "http://a0.twimg.com/profile_images/1111111111/example.jpeg"
"link": "http://twitter.com/johndoe"
},
"type": “reply",
"created_at": "Fri, 17 Aug 2012 14:13:08 +0000",
"content": "I like ice cream!",
"id": "1e1e875ab43fa233e074337458bc1dca",
"link": "http://twitter.com/johndoe/statuses/111111111111111111",
"geo": {
"latitude": 42.376104,
"longitude": -71.237189
}
},
"twitter": {
"created_at": "Fri, 17 Aug 2012 14:13:08 +0000"
},
"demographic": {
"gender": "mostly_male"
}
• Leverage Metadata to set Context to your documents
• Review Sites – Ask a Pointed Question
CONTEXT IS KING
I love it!
Venue Name Movie Name
Metadata can tell you what “it” is.
Product Name Brand Name
CONTEXT IS KING
Interaction Type = Reply
Child Post
I love it. Although, every morning, I keep
hearing growling and squealing noises
coming from the back seat. :P
Interaction Type = Tweet
Parent Post
Your new Prius is awesome.
How do you like it?
Context
CONTEXT IS KING
Interaction Type = Tweet
Interaction Author = @MrsKursar
I love it. Although, every morning, I keep
hearing growling and squealing noises
coming from the back seat. :P
Same text different context
Brian Kursar – Sr Data Scientist
Toyota Motor Sales IT Research and Development
@briankursar
toyota.com/careers

Mais conteúdo relacionado

Mais procurados

Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and Tracking
Databricks
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup Group
Doug Needham
 

Mais procurados (20)

Sparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya Hristakeva
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
 
StackNet Meta-Modelling framework
StackNet Meta-Modelling frameworkStackNet Meta-Modelling framework
StackNet Meta-Modelling framework
 
Embracing a Taxonomy of Types to Simplify Machine Learning with Leah McGuire
Embracing a Taxonomy of Types to Simplify Machine Learning with Leah McGuireEmbracing a Taxonomy of Types to Simplify Machine Learning with Leah McGuire
Embracing a Taxonomy of Types to Simplify Machine Learning with Leah McGuire
 
Applied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R WorkshopApplied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R Workshop
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 
GraphLab Conference 2014 Keynote - Carlos Guestrin
GraphLab Conference 2014 Keynote - Carlos GuestrinGraphLab Conference 2014 Keynote - Carlos Guestrin
GraphLab Conference 2014 Keynote - Carlos Guestrin
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and Tracking
 
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
 
Introduction to Auto ML
Introduction to Auto MLIntroduction to Auto ML
Introduction to Auto ML
 
AI Development with H2O.ai
AI Development with H2O.aiAI Development with H2O.ai
AI Development with H2O.ai
 
MongoDB & Machine Learning
MongoDB & Machine LearningMongoDB & Machine Learning
MongoDB & Machine Learning
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup Group
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
 
What’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackWhat’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics Stack
 
AI from your data lake: Using Solr for analytics
AI from your data lake: Using Solr for analyticsAI from your data lake: Using Solr for analytics
AI from your data lake: Using Solr for analytics
 

Destaque

Design Portfolio 2016 (press quality)
Design Portfolio 2016 (press quality)Design Portfolio 2016 (press quality)
Design Portfolio 2016 (press quality)
Callam Hele
 
Sustainable Platforming with Design
Sustainable Platforming with DesignSustainable Platforming with Design
Sustainable Platforming with Design
Tatu Marttila
 
Bmw lean manufacturing (pratik negi)
Bmw lean manufacturing  (pratik negi)Bmw lean manufacturing  (pratik negi)
Bmw lean manufacturing (pratik negi)
pratik negi
 

Destaque (20)

Can automated feature engineering prevent target leaks
Can automated feature engineering prevent target leaks Can automated feature engineering prevent target leaks
Can automated feature engineering prevent target leaks
 
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
 
L5. Data Transformation and Feature Engineering
L5. Data Transformation and Feature EngineeringL5. Data Transformation and Feature Engineering
L5. Data Transformation and Feature Engineering
 
Feature engineering for diverse data types
Feature engineering for diverse data typesFeature engineering for diverse data types
Feature engineering for diverse data types
 
Reverse Engineering Feature Models From Software Variants to Build Software P...
Reverse Engineering Feature Models From Software Variants to Build Software P...Reverse Engineering Feature Models From Software Variants to Build Software P...
Reverse Engineering Feature Models From Software Variants to Build Software P...
 
VSSML16 L6. Feature Engineering
VSSML16 L6. Feature EngineeringVSSML16 L6. Feature Engineering
VSSML16 L6. Feature Engineering
 
BSSML16 L7. Feature Engineering
BSSML16 L7. Feature EngineeringBSSML16 L7. Feature Engineering
BSSML16 L7. Feature Engineering
 
Design Portfolio 2016 (press quality)
Design Portfolio 2016 (press quality)Design Portfolio 2016 (press quality)
Design Portfolio 2016 (press quality)
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering
 
Presentation1
Presentation1Presentation1
Presentation1
 
Sustainable Platforming with Design
Sustainable Platforming with DesignSustainable Platforming with Design
Sustainable Platforming with Design
 
Bmw lean manufacturing (pratik negi)
Bmw lean manufacturing  (pratik negi)Bmw lean manufacturing  (pratik negi)
Bmw lean manufacturing (pratik negi)
 
Product Strategy of toyota
Product Strategy of toyota Product Strategy of toyota
Product Strategy of toyota
 
Reducing the dimensionality of data with neural networks
Reducing the dimensionality of data with neural networksReducing the dimensionality of data with neural networks
Reducing the dimensionality of data with neural networks
 
Requirements quality management within the airbus group v3
Requirements quality management within the airbus group v3Requirements quality management within the airbus group v3
Requirements quality management within the airbus group v3
 
Anomaly Detection at Scale
Anomaly Detection at ScaleAnomaly Detection at Scale
Anomaly Detection at Scale
 
The How and Why of Feature Engineering
The How and Why of Feature EngineeringThe How and Why of Feature Engineering
The How and Why of Feature Engineering
 
Airbus Presentation - How They Improved Delivery Speed and Quality
Airbus Presentation - How They Improved Delivery Speed and QualityAirbus Presentation - How They Improved Delivery Speed and Quality
Airbus Presentation - How They Improved Delivery Speed and Quality
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to production
 
Culture Change By Design in Toyota Europe
Culture Change By Design in Toyota EuropeCulture Change By Design in Toyota Europe
Culture Change By Design in Toyota Europe
 

Semelhante a Big Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota

Marcus Tober - SEO Has Changed Forever – Why User Signals and Content Relevan...
Marcus Tober - SEO Has Changed Forever – Why User Signals and Content Relevan...Marcus Tober - SEO Has Changed Forever – Why User Signals and Content Relevan...
Marcus Tober - SEO Has Changed Forever – Why User Signals and Content Relevan...
Marketing Festival
 
Advanced Techniques to Make Your Website Sizzle
Advanced Techniques to Make Your Website SizzleAdvanced Techniques to Make Your Website Sizzle
Advanced Techniques to Make Your Website Sizzle
Angela Leavitt
 
Conversionista : Conversion manager course - Stockholm 20 march 2013
Conversionista : Conversion manager course  - Stockholm 20 march 2013Conversionista : Conversion manager course  - Stockholm 20 march 2013
Conversionista : Conversion manager course - Stockholm 20 march 2013
Craig Sullivan
 
Belladati Meetup Singapore Workshop
Belladati Meetup Singapore WorkshopBelladati Meetup Singapore Workshop
Belladati Meetup Singapore Workshop
belladati
 
Jane_clark_pres2016_FUP_web
Jane_clark_pres2016_FUP_webJane_clark_pres2016_FUP_web
Jane_clark_pres2016_FUP_web
Jane Clark
 

Semelhante a Big Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota (20)

2020 02 29 TechDay Conf - Getting started with Machine Learning.Net
2020 02 29 TechDay Conf - Getting started with Machine Learning.Net2020 02 29 TechDay Conf - Getting started with Machine Learning.Net
2020 02 29 TechDay Conf - Getting started with Machine Learning.Net
 
2019 12 19 Mississauga .Net User Group - Machine Learning.Net and Auto ML
2019 12 19 Mississauga .Net User Group - Machine Learning.Net and Auto ML2019 12 19 Mississauga .Net User Group - Machine Learning.Net and Auto ML
2019 12 19 Mississauga .Net User Group - Machine Learning.Net and Auto ML
 
Marcus Tober - SEO Has Changed Forever – Why User Signals and Content Relevan...
Marcus Tober - SEO Has Changed Forever – Why User Signals and Content Relevan...Marcus Tober - SEO Has Changed Forever – Why User Signals and Content Relevan...
Marcus Tober - SEO Has Changed Forever – Why User Signals and Content Relevan...
 
Marketing and-branding-for-geniuses mad-genius
Marketing and-branding-for-geniuses mad-geniusMarketing and-branding-for-geniuses mad-genius
Marketing and-branding-for-geniuses mad-genius
 
The Odd Couple of UX Design
The Odd Couple of UX DesignThe Odd Couple of UX Design
The Odd Couple of UX Design
 
Advanced Techniques to Make Your Website Sizzle
Advanced Techniques to Make Your Website SizzleAdvanced Techniques to Make Your Website Sizzle
Advanced Techniques to Make Your Website Sizzle
 
Why analytics projects fail
Why analytics projects failWhy analytics projects fail
Why analytics projects fail
 
WHY DO SO MANY ANALYTICS PROJECTS STILL FAIL?
WHY DO SO MANY ANALYTICS PROJECTS STILL FAIL?WHY DO SO MANY ANALYTICS PROJECTS STILL FAIL?
WHY DO SO MANY ANALYTICS PROJECTS STILL FAIL?
 
2020 09 24 - CONDG ML.Net
2020 09 24 - CONDG ML.Net2020 09 24 - CONDG ML.Net
2020 09 24 - CONDG ML.Net
 
Markus Tober – BrightonSEO April 2016: Ranking Factors Reloaded – Why Content...
Markus Tober – BrightonSEO April 2016: Ranking Factors Reloaded – Why Content...Markus Tober – BrightonSEO April 2016: Ranking Factors Reloaded – Why Content...
Markus Tober – BrightonSEO April 2016: Ranking Factors Reloaded – Why Content...
 
NDC Sydney 2018 | Bots - the Next UI Revolution | Adam Stephensen
NDC Sydney 2018 | Bots - the Next UI Revolution | Adam StephensenNDC Sydney 2018 | Bots - the Next UI Revolution | Adam Stephensen
NDC Sydney 2018 | Bots - the Next UI Revolution | Adam Stephensen
 
Conversionista : Conversion manager course - Stockholm 20 march 2013
Conversionista : Conversion manager course  - Stockholm 20 march 2013Conversionista : Conversion manager course  - Stockholm 20 march 2013
Conversionista : Conversion manager course - Stockholm 20 march 2013
 
Belladati Meetup Singapore Workshop
Belladati Meetup Singapore WorkshopBelladati Meetup Singapore Workshop
Belladati Meetup Singapore Workshop
 
Jane_clark_pres2016_FUP_web
Jane_clark_pres2016_FUP_webJane_clark_pres2016_FUP_web
Jane_clark_pres2016_FUP_web
 
UX_UI portfolio
UX_UI portfolioUX_UI portfolio
UX_UI portfolio
 
MeasureWorks - Design for Fast Experiences (Startup session).key
MeasureWorks  - Design for Fast Experiences (Startup session).keyMeasureWorks  - Design for Fast Experiences (Startup session).key
MeasureWorks - Design for Fast Experiences (Startup session).key
 
2020 04 10 Catch IT - Getting started with ML.Net
2020 04 10 Catch IT - Getting started with ML.Net2020 04 10 Catch IT - Getting started with ML.Net
2020 04 10 Catch IT - Getting started with ML.Net
 
2020 04 04 NetCoreConf - Machine Learning.Net
2020 04 04 NetCoreConf - Machine Learning.Net2020 04 04 NetCoreConf - Machine Learning.Net
2020 04 04 NetCoreConf - Machine Learning.Net
 
danmcclary-pspresentation-katieboyle-171030115522.pdf
danmcclary-pspresentation-katieboyle-171030115522.pdfdanmcclary-pspresentation-katieboyle-171030115522.pdf
danmcclary-pspresentation-katieboyle-171030115522.pdf
 
Why Big and Small Data Is Important by Google's Product Manager
Why Big and Small Data Is Important by Google's Product ManagerWhy Big and Small Data Is Important by Google's Product Manager
Why Big and Small Data Is Important by Google's Product Manager
 

Mais de Data Con LA

Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA
 

Mais de Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Último

Último (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Big Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota

  • 1. TOYOTA Machine Learning on Apache SparkTM FEATURE ENGINEERING Brian Kursar, Sr Data Scientist Toyota Motor Sales IT Research and Development Big Data Day LA 2015
  • 3. Big Data Increased Storage Capacity Faster Processing
  • 4. TOYOTA Big Data History 2015 2014 2013 2012 2011 2010 C360 - Next Gen Insights Platform Over 6B Records C360 - Customer Experience Analytics Over 700M Records C360 - Toyota Social Media Intelligence Center Over 500M Records Product Quality Analytics v2 Over 120M Records Marketing and Incentives Analytics 70M Records Product Quality Analytics Over 60M Records
  • 5. Sentiment Analysis Basic Sentiment Analysis is not enough
  • 6. Existing Tools Jan – Feb Feb - Mar +1% +1% +1% +2% Toyota Social Opinion 2013 WHY? It doesn’t give you the “Why”
  • 7. 40% Retailers Selling Toyota Vehicles 11% Opinions on Marketing Campaigns 10% Feedback on Dealer Sales and Service Experiences 9% Opinions on Product Styling and Features 8% People In Market for a Toyota 8% Incident Reports Involving a Toyota Vehicle 7% Feedback on Product Quality 5% Customers Advocating for the Brand 2% Completely Irrelevant Toyota Online Conversations by the Numbers 2014 Study
  • 8. Toyota Online Conversations by the Numbers 40% Retailers Selling Toyota Vehicles 11% Opinions on Marketing Campaigns 10% Feedback on Dealer Sales and Service Experiences 9% Opinions on Product Styling and Features 8% People In Market for a Toyota 8% Incident Reports Involving a Toyota Vehicle 7% Feedback on Product Quality 5% Customers Advocating for the Brand 2% Completely Irrelevant 2014 Study 50% Noise
  • 9. Millions of Social Media posts a day and not enough Resources to read them all Problem Statement
  • 10. Categorize and Prioritize incoming Social Media interactions in Real-Time using Machine Learning to provide ACTIONABLE INSIGHTS Campaign Opinions Customer Feedback Product Feedback Noise Technology Opportunities
  • 11. Is this the image that executives at your company have when they hear the words “Machine Learning?” If so, help make it relative.
  • 12. First Spark MLlib Experiment • Seat Cover Wrinkles/Cracking • Brake Noise • Shift Quality • Oil Leaks • HVAC Odor • Dead Battery • Rodent Wire Harness Damage • Paint Chips Time-box project to 12 Weeks Classify at min 80% accuracy
  • 13. How does one find training data for a noise they have never personally experienced?
  • 14. NoiseBrakes If it was as easy as using key words “Brake” and “Noise” then why bother use Machine Learning?
  • 15. Where can we find categorized specific Product Quality Concerns reported by Customers to use as Training Data? Challenge
  • 16. Mechanical Turk The Amazon Mechanical Turk (MTurk) is a crowdsourcing Internet marketplace that enables individuals and businesses (known as Requesters) to coordinate the use of human intelligence to perform tasks that computers are currently unable to do.
  • 17. Mechanical Turk Define Social data with Keywords “Brake” AND Toyota Product line (i.e. Prius, Camry, Avalon, etc.) Define Leverage results from Mechanical Turk as positive and negative training sets. Train Create Human Intelligence Tasks “Is this comment indicative of a problem concerning Brake Noise?” Y N Create
  • 18. When I’m backing up in my 2012 Prius, it sounds like something hanging up or scraping as it rotates and only happens in the morning.. I hear a squeak coming from the back wheel of my Prius as I pull out from my driveway in the morning. True positives missed by workers in labeling due to lack of experience with the problem they are looking to identify.
  • 19. Leverage Similar Internal Datasets as Training Data
  • 20. Internal Data can help build Training set with relevant Features
  • 21. Social ML Pipeline Hand Labeled Social Labeled Internal Survey Responses Featurizer Chi-Square Feature Selector All Random Top Popular Top Random N-Grams Cleanse N-Gram Extraction Stemming Stop Words Natural Language Processing Training Data
  • 23. Social ML Pipeline Training Data Hand Labeled Social Cleanse N-Gram Extraction Stemming Stop Words Synonyms POS Natural Language Processing Labeled Survey Responses Featurizer Chi-Square Feature Selector All Random Top Popular Top Random N-Grams Support Vector Machines Training Selection Filters Training Set 9 Fold Validation Set 1 Fold Vectorizer TF*IDF TF Options Natural Boolean Log Augmented IDF Options Unary Inverse Inv Smooth ProbDF Features
  • 25. Social ML Pipeline Training Data Hand Labeled Social Cleanse N-Gram Extraction Stemming Stop Words Synonyms POS Natural Language Processing Labeled Survey Responses Featurizer Chi-Square Feature Selector All Random Top Popular Top Random N-Grams Support Vector Machines Training Selection Filters Training Set 9 Fold Validation Set 1 Fold Vectorizer TF*IDF TF Options Natural Boolean Log Augmented IDF Options Unary Inverse Inv Smooth ProbDF Features Model
  • 26.
  • 30. False Positives I just had my friend at the toyota dealer rotate my tires and he said … that the brake pads are getting thin really fast. So what should I do when they get too thin in the future and start to squeak?
  • 31. False Positives i cut the iac hose as shown in figure 20 in the manual but when i start the car, it started gasping for air... choking... sounds like it's about to die out. i bought the power brake check valve (80190 part for kragen)... but either i'm not installing it right or it's the wrong size... i have no idea.
  • 33. Explicit Semantic Analysis “word” ---> <concept1, weight1>, <concept2, weight2>, <concept3, weight3> REPAIR MANUAL
  • 36. Noise squeak grind groan squeal Measure Distance Similarity Between Concepts pads caliper rotor wheel drum Brakes Distance is calculated between concepts based on the Minkowski distance formula.
  • 38.
  • 41. Social ML Pipeline Training Data Hand Labeled Social Cleanse N-Gram Extraction Stemming Stop Words Natural Language Processing Labeled Survey Responses Vectorizer TF*IDF TF Options Natural Boolean Log Augmented IDF Options Unary Inverse Inv Smooth ProbDF Featurizer Chi-Square Feature Selector All Random Top Popular Top Random Support Vector Machines Training Selection Filters Training Set 9 Fold Validation Set 1 Fold N-Grams Model Synonyms POS Distance Similarity Euclidean Distance (L2) - √(|x|^ 2 + |y|^ 2) Manhattan Distance (L1) - |x|+ |y| Concept Manager Concept Interpreter Negation Evaluator
  • 42. TODAY
  • 43. • Education and Inclusion TEAM TOYOTA TIPS
  • 44. TO STEM OR NOT TO STEM noun verb When does stemming cause your entities to change?
  • 45. TO STEM OR NOT TO STEM noun noun Will stemming produce false positives?
  • 46. Negation – to contradict or deny something. NEGATION Great work done in medical document retrieval can be leveraged. How different is a sick person from a sick car?
  • 47. • absence • did not • didn't • doesn't • isn’t • neither • never • no • none • without NEGATION did not absence no isn’t none Negation never
  • 48. I a am an and any my no the took N-GRAMS AND STOP WORDS Generic definition - the most common words in a language. • For some models, Unigrams (single word) can be problematic due to lack of adjacent term which may assist in disambiguation or could indicate a voice. • Helps to keep possessive nouns especially in the feature set versus out via a stop list. Especially for Twitter data. It helps identify a “Voice” versus marketing noise. I a am an and any my no the took
  • 49. • Each Model should have its own carefully selected stop words list • Utilize unimportant entities in Stop Words Lists N-GRAMS AND STOP LISTS
  • 50. "interaction": { "source": "web", "author": { "username": "johndoe", "name": "John Doe", “id": 10750902, “parentid": 10750901, "avatar": "http://a0.twimg.com/profile_images/1111111111/example.jpeg" "link": "http://twitter.com/johndoe" }, "type": “reply", "created_at": "Fri, 17 Aug 2012 14:13:08 +0000", "content": "I like ice cream!", "id": "1e1e875ab43fa233e074337458bc1dca", "link": "http://twitter.com/johndoe/statuses/111111111111111111", "geo": { "latitude": 42.376104, "longitude": -71.237189 } }, "twitter": { "created_at": "Fri, 17 Aug 2012 14:13:08 +0000" }, "demographic": { "gender": "mostly_male" } CONTEXT IS KING }, • Not all Social Media Interactions are equal • Some offer better Metadata than others • Leverage relevant Metadata as features, feature weights, or filters "interaction": { "source": "web", "author": { "username": "johndoe", "name": "John Doe", “id": 10750902, “parentid": 10750901, "avatar": "http://a0.twimg.com/profile_images/1111111111/example.jpeg" "link": "http://twitter.com/johndoe" }, "type": “reply", "created_at": "Fri, 17 Aug 2012 14:13:08 +0000", "content": "I like ice cream!", "id": "1e1e875ab43fa233e074337458bc1dca", "link": "http://twitter.com/johndoe/statuses/111111111111111111", "geo": { "latitude": 42.376104, "longitude": -71.237189 } }, "twitter": { "created_at": "Fri, 17 Aug 2012 14:13:08 +0000" }, "demographic": { "gender": "mostly_male" }
  • 51. • Leverage Metadata to set Context to your documents • Review Sites – Ask a Pointed Question CONTEXT IS KING I love it! Venue Name Movie Name Metadata can tell you what “it” is. Product Name Brand Name
  • 52. CONTEXT IS KING Interaction Type = Reply Child Post I love it. Although, every morning, I keep hearing growling and squealing noises coming from the back seat. :P Interaction Type = Tweet Parent Post Your new Prius is awesome. How do you like it? Context
  • 53. CONTEXT IS KING Interaction Type = Tweet Interaction Author = @MrsKursar I love it. Although, every morning, I keep hearing growling and squealing noises coming from the back seat. :P Same text different context
  • 54. Brian Kursar – Sr Data Scientist Toyota Motor Sales IT Research and Development @briankursar toyota.com/careers