SlideShare uma empresa Scribd logo
1 de 29
Running with Elephants
Predictive Analytics with Mahout & HDInsight
Introduction
Chris Price
Senior BI Consultant with Pragmatic Works
Author
Regular Speaker
Data Geek & Super Dad!
@BluewaterSQL
http://bluewatersql.wordpress.com/
cprice@pragmaticworks.com
You are the demo….
SQL Brewhaus
http://sqlbrewhaus.azurewebsites.net
Create an Account… Rate some beers…
Don’t worry your info
will only be sold to the
HIGHEST bidder
Agenda
• Business Case for Recommendations
• How a Recommendation Engine Works
• Recommendation Implementation & Integration
• Evaluating Recommendations
• Challenges of Implementing Recommendations
Making the Business Case
Objective
Increase
Revenue
Increase #
of Orders
Increase
Items per
Order
Increase
Average
Item Price
Up-Sell Website
Navigational
Inefficiency
Cross-Sell
Business Case Example
Increased
Revenue
Recommendation Engines
• Take observation data and use data mining/machine
learning algorithms to predict outcomes
• Assumptions:
• People with similar interest have common preferences
• Sufficiently large number of preferences available
Recommendation Options
• Collaborative Filtering (Mahout)
• User-Based
• Item-Based
• Content-Based (Mahout Clustering)
• Data Mining (SSAS)
• Association
• Clustering
Technology
• A scalable machine learning library
• Fast, Efficient & Pragmatic
• Many of the algorithms can be run on Hadoop
HDInsight
• Hadoop on Windows
• HDInsight on Windows Azure (Seamlessly scale in the cloud)
• HortonWorks Data Platform/HDP (On-Premise Solution)
Generating Recommendations
1. Sources of Data
2. Clean & Prepare Data
3. Generate Recommendations
• Build User/Item matrix
• Calculate User Similarity
• Form Neighborhoods
• Generate Recommendations
Sources of Data
• Implicit
• Ratings
• Feedback
• Demographics
• Psychographics (Personality/Lifestyle/Attitude),
• Ephemeral Need (Need for a moment)
• Explicit
• Purchase History
• Click/Browse History
• Product/Item
• Taxonomy
• Attributes
• Descriptions
Our focus for today
Data Preparation
• Clean-Up:
• Remove Outliers (Z-Score)
• Remove frequent buyers (Skew)
• Normalize Data (Unity-Based)
• Format Data into CSV input file:
<User ID>, <Item ID>, <Rating>
How it Works?
• Build a User/Item Matrix
Items
Users
1 2 3 4 5 6 7 8 9 10 … n
1 1 1 1 1
2 1 1 1
3 1 1 1 1 1
4 1 1 1
… 1 1
N
Neighborhood Formation
U2
U1
U5
U3
U6
U7
U4
Neighborhood Formation
• Requires some experimentation
• Similarity Metrics
• Pearson Correlation
• Euclidean Distance
• Spearman Correlation
• Cosine
• Tanimoto Coefficient
• Log-Likelihood
How it Works?
• Find users similar to U5
• Use a similarity metric (kNN)
• U1 & U7 are identified as most similar to U5
Items
Users
1 2 3 4 5 6 7 8 9 10 … n
1 1 1 1 1 1
2 1 1 1
3 1 1 1 1 1
4 1 1 1
… 1 1
N
How it Works?
• Generate Recommendations:
• Find items that have not been reviewed (I1 and I6)
• Predict rating by taking weighted sum
Items
Users
1 2 3 4 5 6 7 8 9 10 … n
1 1 1 1 0.5 1 1
2 1 1 1
3 1 1 1 1 1
4 1 1 1
5 1 1
6 0.7 1
Pseudo-Code Implementation
for each item i that u has no preference
for each user v that has a preference for i
compute similarity s between u and v
calculate running average of v‘s
preference for i, weighted by s
return top ranked (weighted average) i
Restrict to Neighborhood
Mahout Implementation
• Real-Time Recommendations
• Write Java Code and host in JVM Instance
• Limited scalability
• Requires Training Data
• Integration typically handled through web services
• Batch-Based Recommendations
• Uses MapReduce jobs on Hadoop
• Offline, Slow, yet scalable
• Out-of-the-box recommender jobs
Mahout MapReduce Implementation
1 – Generate List of ItemIDs
2 – Create Preference Vector
3 – Count Unique Users
4 – Transpose Preference Vectors
5 – Row Similarity
• Compute Weights
• Computer Similarities
• Similarity Matrix
6 – Pre-Partial Multiply, Similarity Matrix
7 – Pre-Partial Multiply, Preferences
8 – Partial Multiple (Steps 6 & 7)
9 – Filter Items
10 – Aggregate & Recommend
Integrating Mahout
• Real-Time
• Requires Java coding
• Web Service
• Process:
• Load training data (memory pressure)
• Generate recommendations
• Batch
• ETL from source
• Generate input file (UserID, ItemID, Rating)
• Load to HDFS
• Process with Mahout/Hadoop
• ETL output from HDFS/Hadoop
• 7 [1:4.5,2:4.5,3:4.5,4:4.5,5:4.5,6:4.5,7:4.5]
• UserID [ItemID:Estimate Rating, ………]
Handling Recommendations
Storing Recommendations:
• Hive
• Data Warehouse system for Hadoop
• Hive ODBC Driver
• MongoDB
• Leading NOSQL database
• JSON-like storage with flexible schema
• C#/.Net MongoDB Driver
• HBase
• Open-source distributed, column-oriented database modeled after
Google’s BigTable
• Use Pig/MapReduce to process output files and load HBase table
• Java API for easy reading
• Source System (SQL Server, etc)
Evaluating the Recommendations
• How good are your recommendations?
• How do you evaluate the recommendation engine?
• Two options both split data into test & training data sets:
• Average Difference
• Root-Mean Square
• How it works?
I1 I2 I3
Estimated Review 3.5 4.0 1.5
Actual Review 4.0 2.0 2.0
Absolute Difference 0.5 2.0 0.5
Average Difference = (0.5 + 2.0 + 0.5) / 3 = 1.0
Root-Mean-Square = √((0.52 + 2.02 + 0.52) / 3) = 1.23
Evaluating the Recommendations
DataModel model = new FileDataModel(new File(“ratings.csv”));
RecommenderEvaluator eval = new AverageAbsoluteDifferenceRecommenderEvaluator();
RecommenderBuilder bldr = new RecommenderBuilder(){
@Override
public Recommender buildRecommender(DataModel model) throws TasteException{
//Use the Pearson Correlation to calculate similarity
UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
//Generate neighborhoods of approx. 10 users
UserNeighborhood hood = new NearestUserNeighborhood(10, similarity,
model);
return new GenericUserBasedRecommender(model, hood, similarity);
}
};
//Use 70% of the data to train the model and 30% to test
double score = eval.evaluate(bldr, model, 0.7, 1.0);
Challenges
1. Context
2. Cold Start
3. Data Scarsity
4. Popularity Bias
5. Curse of Dimensionality
Context Challenges
???
January
20 degrees &
Snowing…..
Other Challenges
• Cold Start
• Occurs when either a new item or new user is introduced
• Can be handled by:
• Can substitute average item/user profile
• Use another recommendation generation technique (Content-Based)
• Data Sparsity
• Too many items/user make finding intersections difficult
• Popularity Bias
• Skewed towards popular items, people with “unique” taste are
left out
• Curse of Dimensionality
• More items/user leads to more noise and greater error
Resources
Mahout in Action
Sean Owen, Robin Anil, Ted Dunning,
Ellen Friedman
Hadoop: The Definitive Guide
Tom White
Thank You!
@BluewaterSQL
http://bluewatersql.wordpress.com/
cprice@pragmaticworks.com
QUESTIONS???

Mais conteúdo relacionado

Mais procurados

Credit Card Fraud Detection Using ML In Databricks
Credit Card Fraud Detection Using ML In DatabricksCredit Card Fraud Detection Using ML In Databricks
Credit Card Fraud Detection Using ML In DatabricksDatabricks
 
Mining Credit Card Defults
Mining Credit Card DefultsMining Credit Card Defults
Mining Credit Card DefultsKrunal Khatri
 
The Impact of Data Science on Finance
The Impact of Data Science on FinanceThe Impact of Data Science on Finance
The Impact of Data Science on FinanceRoger Fried
 
10.sentiment analysis of customer product reviews using machine learni
10.sentiment analysis of customer product reviews using machine learni10.sentiment analysis of customer product reviews using machine learni
10.sentiment analysis of customer product reviews using machine learniVenkat Projects
 
Sentiment Analysis Using Product Review
Sentiment Analysis Using Product ReviewSentiment Analysis Using Product Review
Sentiment Analysis Using Product ReviewAbdullah Moin
 
Data science lecture1_doaa_mohey
Data science lecture1_doaa_moheyData science lecture1_doaa_mohey
Data science lecture1_doaa_moheyDoaa Mohey Eldin
 
What is Data analytics and it's importance ?
What is Data analytics and it's importance ?What is Data analytics and it's importance ?
What is Data analytics and it's importance ?AbhayDhupar
 
Application of predictive analytics
Application of predictive analyticsApplication of predictive analytics
Application of predictive analyticsPrasad Narasimhan
 
Credit Card Fraudulent Transaction Detection Research Paper
Credit Card Fraudulent Transaction Detection Research PaperCredit Card Fraudulent Transaction Detection Research Paper
Credit Card Fraudulent Transaction Detection Research PaperGarvit Burad
 
Impact of Data Science
Impact of Data Science Impact of Data Science
Impact of Data Science kumari36
 

Mais procurados (20)

Predictive Modelling
Predictive ModellingPredictive Modelling
Predictive Modelling
 
Credit Card Fraud Detection Using ML In Databricks
Credit Card Fraud Detection Using ML In DatabricksCredit Card Fraud Detection Using ML In Databricks
Credit Card Fraud Detection Using ML In Databricks
 
Predictive data analytics models and their applications
Predictive data analytics models and their applicationsPredictive data analytics models and their applications
Predictive data analytics models and their applications
 
Classes of Model
Classes of ModelClasses of Model
Classes of Model
 
Creditcard
CreditcardCreditcard
Creditcard
 
Mining Credit Card Defults
Mining Credit Card DefultsMining Credit Card Defults
Mining Credit Card Defults
 
Predictive modeling
Predictive modelingPredictive modeling
Predictive modeling
 
The Impact of Data Science on Finance
The Impact of Data Science on FinanceThe Impact of Data Science on Finance
The Impact of Data Science on Finance
 
Resume_Rashmi
Resume_RashmiResume_Rashmi
Resume_Rashmi
 
10.sentiment analysis of customer product reviews using machine learni
10.sentiment analysis of customer product reviews using machine learni10.sentiment analysis of customer product reviews using machine learni
10.sentiment analysis of customer product reviews using machine learni
 
Chatbots: Automated Conversational Model using Machine Learning
Chatbots: Automated Conversational Model using Machine LearningChatbots: Automated Conversational Model using Machine Learning
Chatbots: Automated Conversational Model using Machine Learning
 
7 steps to Predictive Analytics
7 steps to Predictive Analytics 7 steps to Predictive Analytics
7 steps to Predictive Analytics
 
Sentiment Analysis Using Product Review
Sentiment Analysis Using Product ReviewSentiment Analysis Using Product Review
Sentiment Analysis Using Product Review
 
Data science lecture1_doaa_mohey
Data science lecture1_doaa_moheyData science lecture1_doaa_mohey
Data science lecture1_doaa_mohey
 
What is Data analytics and it's importance ?
What is Data analytics and it's importance ?What is Data analytics and it's importance ?
What is Data analytics and it's importance ?
 
Application of predictive analytics
Application of predictive analyticsApplication of predictive analytics
Application of predictive analytics
 
Data Visualization: Sales forecasting
Data Visualization: Sales forecastingData Visualization: Sales forecasting
Data Visualization: Sales forecasting
 
Credit Card Fraudulent Transaction Detection Research Paper
Credit Card Fraudulent Transaction Detection Research PaperCredit Card Fraudulent Transaction Detection Research Paper
Credit Card Fraudulent Transaction Detection Research Paper
 
Impact of Data Science
Impact of Data Science Impact of Data Science
Impact of Data Science
 
Case study for DWDM
Case study for DWDMCase study for DWDM
Case study for DWDM
 

Semelhante a Running with Elephants: Predictive Analytics with HDInsight

Modern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyModern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyKris Jack
 
Modern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyModern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyMaya Hristakeva
 
From Labelling Open data images to building a private recommender system
From Labelling Open data images to building a private recommender systemFrom Labelling Open data images to building a private recommender system
From Labelling Open data images to building a private recommender systemPierre Gutierrez
 
SSAS Design &amp; Incremental Processing - PASSMN May 2010
SSAS Design &amp; Incremental Processing - PASSMN May 2010SSAS Design &amp; Incremental Processing - PASSMN May 2010
SSAS Design &amp; Incremental Processing - PASSMN May 2010Dan English
 
Buidling large scale recommendation engine
Buidling large scale recommendation engineBuidling large scale recommendation engine
Buidling large scale recommendation engineKeeyong Han
 
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...RTTS
 
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...Sonya Liberman
 
Avoiding test hell
Avoiding test hellAvoiding test hell
Avoiding test hellYun Ki Lee
 
How to Automate your Enterprise Application / ERP Testing
How to Automate your  Enterprise Application / ERP TestingHow to Automate your  Enterprise Application / ERP Testing
How to Automate your Enterprise Application / ERP TestingRTTS
 
Load testing with Visual Studio and Azure - Andrew Siemer
Load testing with Visual Studio and Azure - Andrew SiemerLoad testing with Visual Studio and Azure - Andrew Siemer
Load testing with Visual Studio and Azure - Andrew SiemerAndrew Siemer
 
7 steps to simplifying your AI workflows
7 steps to simplifying your AI workflows7 steps to simplifying your AI workflows
7 steps to simplifying your AI workflowsWisecube AI
 
SPSNL17 - Implementing SharePoint hybrid search, start to finish - Thomas Voc...
SPSNL17 - Implementing SharePoint hybrid search, start to finish - Thomas Voc...SPSNL17 - Implementing SharePoint hybrid search, start to finish - Thomas Voc...
SPSNL17 - Implementing SharePoint hybrid search, start to finish - Thomas Voc...DIWUG
 
An introduction to azure machine learning
An introduction to azure machine learningAn introduction to azure machine learning
An introduction to azure machine learningDoug Kline
 
Enabling real interactive BI on Hadoop
Enabling real interactive BI on HadoopEnabling real interactive BI on Hadoop
Enabling real interactive BI on HadoopDataWorks Summit
 
Tutorial Mahout - Recommendation
Tutorial Mahout - RecommendationTutorial Mahout - Recommendation
Tutorial Mahout - RecommendationCataldo Musto
 
A Beginner's Guide to Ember
A Beginner's Guide to EmberA Beginner's Guide to Ember
A Beginner's Guide to EmberRichard Martin
 

Semelhante a Running with Elephants: Predictive Analytics with HDInsight (20)

Modern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyModern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in Mendeley
 
Modern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyModern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in Mendeley
 
Test Automation for Data Warehouses
Test Automation for Data Warehouses Test Automation for Data Warehouses
Test Automation for Data Warehouses
 
Machine learning
Machine learningMachine learning
Machine learning
 
From Labelling Open data images to building a private recommender system
From Labelling Open data images to building a private recommender systemFrom Labelling Open data images to building a private recommender system
From Labelling Open data images to building a private recommender system
 
SSAS Design &amp; Incremental Processing - PASSMN May 2010
SSAS Design &amp; Incremental Processing - PASSMN May 2010SSAS Design &amp; Incremental Processing - PASSMN May 2010
SSAS Design &amp; Incremental Processing - PASSMN May 2010
 
Buidling large scale recommendation engine
Buidling large scale recommendation engineBuidling large scale recommendation engine
Buidling large scale recommendation engine
 
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
 
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
 
Avoiding test hell
Avoiding test hellAvoiding test hell
Avoiding test hell
 
How to Automate your Enterprise Application / ERP Testing
How to Automate your  Enterprise Application / ERP TestingHow to Automate your  Enterprise Application / ERP Testing
How to Automate your Enterprise Application / ERP Testing
 
Business Analytics Forum #BAF3
Business Analytics Forum #BAF3Business Analytics Forum #BAF3
Business Analytics Forum #BAF3
 
Load testing with Visual Studio and Azure - Andrew Siemer
Load testing with Visual Studio and Azure - Andrew SiemerLoad testing with Visual Studio and Azure - Andrew Siemer
Load testing with Visual Studio and Azure - Andrew Siemer
 
7 steps to simplifying your AI workflows
7 steps to simplifying your AI workflows7 steps to simplifying your AI workflows
7 steps to simplifying your AI workflows
 
SPSNL17 - Implementing SharePoint hybrid search, start to finish - Thomas Voc...
SPSNL17 - Implementing SharePoint hybrid search, start to finish - Thomas Voc...SPSNL17 - Implementing SharePoint hybrid search, start to finish - Thomas Voc...
SPSNL17 - Implementing SharePoint hybrid search, start to finish - Thomas Voc...
 
Data manipulation
Data manipulationData manipulation
Data manipulation
 
An introduction to azure machine learning
An introduction to azure machine learningAn introduction to azure machine learning
An introduction to azure machine learning
 
Enabling real interactive BI on Hadoop
Enabling real interactive BI on HadoopEnabling real interactive BI on Hadoop
Enabling real interactive BI on Hadoop
 
Tutorial Mahout - Recommendation
Tutorial Mahout - RecommendationTutorial Mahout - Recommendation
Tutorial Mahout - Recommendation
 
A Beginner's Guide to Ember
A Beginner's Guide to EmberA Beginner's Guide to Ember
A Beginner's Guide to Ember
 

Último

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 

Último (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Running with Elephants: Predictive Analytics with HDInsight

  • 1. Running with Elephants Predictive Analytics with Mahout & HDInsight
  • 2. Introduction Chris Price Senior BI Consultant with Pragmatic Works Author Regular Speaker Data Geek & Super Dad! @BluewaterSQL http://bluewatersql.wordpress.com/ cprice@pragmaticworks.com
  • 3. You are the demo…. SQL Brewhaus http://sqlbrewhaus.azurewebsites.net Create an Account… Rate some beers… Don’t worry your info will only be sold to the HIGHEST bidder
  • 4. Agenda • Business Case for Recommendations • How a Recommendation Engine Works • Recommendation Implementation & Integration • Evaluating Recommendations • Challenges of Implementing Recommendations
  • 5. Making the Business Case Objective Increase Revenue Increase # of Orders Increase Items per Order Increase Average Item Price Up-Sell Website Navigational Inefficiency Cross-Sell
  • 7. Recommendation Engines • Take observation data and use data mining/machine learning algorithms to predict outcomes • Assumptions: • People with similar interest have common preferences • Sufficiently large number of preferences available
  • 8. Recommendation Options • Collaborative Filtering (Mahout) • User-Based • Item-Based • Content-Based (Mahout Clustering) • Data Mining (SSAS) • Association • Clustering
  • 9. Technology • A scalable machine learning library • Fast, Efficient & Pragmatic • Many of the algorithms can be run on Hadoop HDInsight • Hadoop on Windows • HDInsight on Windows Azure (Seamlessly scale in the cloud) • HortonWorks Data Platform/HDP (On-Premise Solution)
  • 10. Generating Recommendations 1. Sources of Data 2. Clean & Prepare Data 3. Generate Recommendations • Build User/Item matrix • Calculate User Similarity • Form Neighborhoods • Generate Recommendations
  • 11. Sources of Data • Implicit • Ratings • Feedback • Demographics • Psychographics (Personality/Lifestyle/Attitude), • Ephemeral Need (Need for a moment) • Explicit • Purchase History • Click/Browse History • Product/Item • Taxonomy • Attributes • Descriptions Our focus for today
  • 12. Data Preparation • Clean-Up: • Remove Outliers (Z-Score) • Remove frequent buyers (Skew) • Normalize Data (Unity-Based) • Format Data into CSV input file: <User ID>, <Item ID>, <Rating>
  • 13. How it Works? • Build a User/Item Matrix Items Users 1 2 3 4 5 6 7 8 9 10 … n 1 1 1 1 1 2 1 1 1 3 1 1 1 1 1 4 1 1 1 … 1 1 N
  • 15. Neighborhood Formation • Requires some experimentation • Similarity Metrics • Pearson Correlation • Euclidean Distance • Spearman Correlation • Cosine • Tanimoto Coefficient • Log-Likelihood
  • 16. How it Works? • Find users similar to U5 • Use a similarity metric (kNN) • U1 & U7 are identified as most similar to U5 Items Users 1 2 3 4 5 6 7 8 9 10 … n 1 1 1 1 1 1 2 1 1 1 3 1 1 1 1 1 4 1 1 1 … 1 1 N
  • 17. How it Works? • Generate Recommendations: • Find items that have not been reviewed (I1 and I6) • Predict rating by taking weighted sum Items Users 1 2 3 4 5 6 7 8 9 10 … n 1 1 1 1 0.5 1 1 2 1 1 1 3 1 1 1 1 1 4 1 1 1 5 1 1 6 0.7 1
  • 18. Pseudo-Code Implementation for each item i that u has no preference for each user v that has a preference for i compute similarity s between u and v calculate running average of v‘s preference for i, weighted by s return top ranked (weighted average) i Restrict to Neighborhood
  • 19. Mahout Implementation • Real-Time Recommendations • Write Java Code and host in JVM Instance • Limited scalability • Requires Training Data • Integration typically handled through web services • Batch-Based Recommendations • Uses MapReduce jobs on Hadoop • Offline, Slow, yet scalable • Out-of-the-box recommender jobs
  • 20. Mahout MapReduce Implementation 1 – Generate List of ItemIDs 2 – Create Preference Vector 3 – Count Unique Users 4 – Transpose Preference Vectors 5 – Row Similarity • Compute Weights • Computer Similarities • Similarity Matrix 6 – Pre-Partial Multiply, Similarity Matrix 7 – Pre-Partial Multiply, Preferences 8 – Partial Multiple (Steps 6 & 7) 9 – Filter Items 10 – Aggregate & Recommend
  • 21. Integrating Mahout • Real-Time • Requires Java coding • Web Service • Process: • Load training data (memory pressure) • Generate recommendations • Batch • ETL from source • Generate input file (UserID, ItemID, Rating) • Load to HDFS • Process with Mahout/Hadoop • ETL output from HDFS/Hadoop • 7 [1:4.5,2:4.5,3:4.5,4:4.5,5:4.5,6:4.5,7:4.5] • UserID [ItemID:Estimate Rating, ………]
  • 22. Handling Recommendations Storing Recommendations: • Hive • Data Warehouse system for Hadoop • Hive ODBC Driver • MongoDB • Leading NOSQL database • JSON-like storage with flexible schema • C#/.Net MongoDB Driver • HBase • Open-source distributed, column-oriented database modeled after Google’s BigTable • Use Pig/MapReduce to process output files and load HBase table • Java API for easy reading • Source System (SQL Server, etc)
  • 23. Evaluating the Recommendations • How good are your recommendations? • How do you evaluate the recommendation engine? • Two options both split data into test & training data sets: • Average Difference • Root-Mean Square • How it works? I1 I2 I3 Estimated Review 3.5 4.0 1.5 Actual Review 4.0 2.0 2.0 Absolute Difference 0.5 2.0 0.5 Average Difference = (0.5 + 2.0 + 0.5) / 3 = 1.0 Root-Mean-Square = √((0.52 + 2.02 + 0.52) / 3) = 1.23
  • 24. Evaluating the Recommendations DataModel model = new FileDataModel(new File(“ratings.csv”)); RecommenderEvaluator eval = new AverageAbsoluteDifferenceRecommenderEvaluator(); RecommenderBuilder bldr = new RecommenderBuilder(){ @Override public Recommender buildRecommender(DataModel model) throws TasteException{ //Use the Pearson Correlation to calculate similarity UserSimilarity similarity = new PearsonCorrelationSimilarity(model); //Generate neighborhoods of approx. 10 users UserNeighborhood hood = new NearestUserNeighborhood(10, similarity, model); return new GenericUserBasedRecommender(model, hood, similarity); } }; //Use 70% of the data to train the model and 30% to test double score = eval.evaluate(bldr, model, 0.7, 1.0);
  • 25. Challenges 1. Context 2. Cold Start 3. Data Scarsity 4. Popularity Bias 5. Curse of Dimensionality
  • 27. Other Challenges • Cold Start • Occurs when either a new item or new user is introduced • Can be handled by: • Can substitute average item/user profile • Use another recommendation generation technique (Content-Based) • Data Sparsity • Too many items/user make finding intersections difficult • Popularity Bias • Skewed towards popular items, people with “unique” taste are left out • Curse of Dimensionality • More items/user leads to more noise and greater error
  • 28. Resources Mahout in Action Sean Owen, Robin Anil, Ted Dunning, Ellen Friedman Hadoop: The Definitive Guide Tom White