Amazon and Twitter do it, Wal-Mart & Facebook too….What about you? Big Data Predictive Analytics is pervasive and with HDInsight it's never been more approachable. In this session you become part of the demo as your clickstream data at our fictional e-commerce website drives user and product recommendations using the built-in Mahout (Taste) algorithms. In this action pack session, real-world and practical solutions for moving data into and out of HDFS (with Sqoop), using Mongo or HBase as a source/destination and of course handling Mahout processing in distributive mode will all be covered.
2. Introduction
Chris Price
Senior BI Consultant with Pragmatic Works
Author
Regular Speaker
Data Geek & Super Dad!
@BluewaterSQL
http://bluewatersql.wordpress.com/
cprice@pragmaticworks.com
3. You are the demo….
SQL Brewhaus
http://sqlbrewhaus.azurewebsites.net
Create an Account… Rate some beers…
Don’t worry your info
will only be sold to the
HIGHEST bidder
4. Agenda
• Business Case for Recommendations
• How a Recommendation Engine Works
• Recommendation Implementation & Integration
• Evaluating Recommendations
• Challenges of Implementing Recommendations
5. Making the Business Case
Objective
Increase
Revenue
Increase #
of Orders
Increase
Items per
Order
Increase
Average
Item Price
Up-Sell Website
Navigational
Inefficiency
Cross-Sell
7. Recommendation Engines
• Take observation data and use data mining/machine
learning algorithms to predict outcomes
• Assumptions:
• People with similar interest have common preferences
• Sufficiently large number of preferences available
9. Technology
• A scalable machine learning library
• Fast, Efficient & Pragmatic
• Many of the algorithms can be run on Hadoop
HDInsight
• Hadoop on Windows
• HDInsight on Windows Azure (Seamlessly scale in the cloud)
• HortonWorks Data Platform/HDP (On-Premise Solution)
10. Generating Recommendations
1. Sources of Data
2. Clean & Prepare Data
3. Generate Recommendations
• Build User/Item matrix
• Calculate User Similarity
• Form Neighborhoods
• Generate Recommendations
11. Sources of Data
• Implicit
• Ratings
• Feedback
• Demographics
• Psychographics (Personality/Lifestyle/Attitude),
• Ephemeral Need (Need for a moment)
• Explicit
• Purchase History
• Click/Browse History
• Product/Item
• Taxonomy
• Attributes
• Descriptions
Our focus for today
12. Data Preparation
• Clean-Up:
• Remove Outliers (Z-Score)
• Remove frequent buyers (Skew)
• Normalize Data (Unity-Based)
• Format Data into CSV input file:
<User ID>, <Item ID>, <Rating>
13. How it Works?
• Build a User/Item Matrix
Items
Users
1 2 3 4 5 6 7 8 9 10 … n
1 1 1 1 1
2 1 1 1
3 1 1 1 1 1
4 1 1 1
… 1 1
N
16. How it Works?
• Find users similar to U5
• Use a similarity metric (kNN)
• U1 & U7 are identified as most similar to U5
Items
Users
1 2 3 4 5 6 7 8 9 10 … n
1 1 1 1 1 1
2 1 1 1
3 1 1 1 1 1
4 1 1 1
… 1 1
N
17. How it Works?
• Generate Recommendations:
• Find items that have not been reviewed (I1 and I6)
• Predict rating by taking weighted sum
Items
Users
1 2 3 4 5 6 7 8 9 10 … n
1 1 1 1 0.5 1 1
2 1 1 1
3 1 1 1 1 1
4 1 1 1
5 1 1
6 0.7 1
18. Pseudo-Code Implementation
for each item i that u has no preference
for each user v that has a preference for i
compute similarity s between u and v
calculate running average of v‘s
preference for i, weighted by s
return top ranked (weighted average) i
Restrict to Neighborhood
19. Mahout Implementation
• Real-Time Recommendations
• Write Java Code and host in JVM Instance
• Limited scalability
• Requires Training Data
• Integration typically handled through web services
• Batch-Based Recommendations
• Uses MapReduce jobs on Hadoop
• Offline, Slow, yet scalable
• Out-of-the-box recommender jobs
21. Integrating Mahout
• Real-Time
• Requires Java coding
• Web Service
• Process:
• Load training data (memory pressure)
• Generate recommendations
• Batch
• ETL from source
• Generate input file (UserID, ItemID, Rating)
• Load to HDFS
• Process with Mahout/Hadoop
• ETL output from HDFS/Hadoop
• 7 [1:4.5,2:4.5,3:4.5,4:4.5,5:4.5,6:4.5,7:4.5]
• UserID [ItemID:Estimate Rating, ………]
22. Handling Recommendations
Storing Recommendations:
• Hive
• Data Warehouse system for Hadoop
• Hive ODBC Driver
• MongoDB
• Leading NOSQL database
• JSON-like storage with flexible schema
• C#/.Net MongoDB Driver
• HBase
• Open-source distributed, column-oriented database modeled after
Google’s BigTable
• Use Pig/MapReduce to process output files and load HBase table
• Java API for easy reading
• Source System (SQL Server, etc)
23. Evaluating the Recommendations
• How good are your recommendations?
• How do you evaluate the recommendation engine?
• Two options both split data into test & training data sets:
• Average Difference
• Root-Mean Square
• How it works?
I1 I2 I3
Estimated Review 3.5 4.0 1.5
Actual Review 4.0 2.0 2.0
Absolute Difference 0.5 2.0 0.5
Average Difference = (0.5 + 2.0 + 0.5) / 3 = 1.0
Root-Mean-Square = √((0.52 + 2.02 + 0.52) / 3) = 1.23
24. Evaluating the Recommendations
DataModel model = new FileDataModel(new File(“ratings.csv”));
RecommenderEvaluator eval = new AverageAbsoluteDifferenceRecommenderEvaluator();
RecommenderBuilder bldr = new RecommenderBuilder(){
@Override
public Recommender buildRecommender(DataModel model) throws TasteException{
//Use the Pearson Correlation to calculate similarity
UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
//Generate neighborhoods of approx. 10 users
UserNeighborhood hood = new NearestUserNeighborhood(10, similarity,
model);
return new GenericUserBasedRecommender(model, hood, similarity);
}
};
//Use 70% of the data to train the model and 30% to test
double score = eval.evaluate(bldr, model, 0.7, 1.0);
27. Other Challenges
• Cold Start
• Occurs when either a new item or new user is introduced
• Can be handled by:
• Can substitute average item/user profile
• Use another recommendation generation technique (Content-Based)
• Data Sparsity
• Too many items/user make finding intersections difficult
• Popularity Bias
• Skewed towards popular items, people with “unique” taste are
left out
• Curse of Dimensionality
• More items/user leads to more noise and greater error