Valencian Summer School 2015
Day 1
Lecture 9
Real World Machine Learning - Cooking Predictions
Andrés González (CleverTask)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
L9. Real World Machine Learning - Cooking Predictions
1. Cooking Predictions
A real case in the hotel sector
Andrés González
Big Data Prediction Manager
andresg@clevertask.com
Twitter: @data_lytics
2. CleverTask Solutions SL - Big Data Business Unit 3
Agenda
Business Need1
“Cooking” Predictions2
Gathering ingredients3
Cleaning and Transforming4
The recipe (the model)5
Tasting the dish6
3. CleverTask Solutions SL - Big Data Business Unit 4
Hotel Sector
• % room occupation.
• Cancellation risk.
• Income.
4. CleverTask Solutions SL - Big Data Business Unit 5
Business Need
Predict client’s
NATIONALITY
BEFORE
client
check-in
11. CleverTask Solutions SL - Big Data Business Unit 12
Machine Learning basics
Can you find patterns in this data?
12. CleverTask Solutions SL - Big Data Business Unit
13
Machine Learning basics
Historical Data Training Prediction
New Data Re-Training
13. CleverTask Solutions SL - Big Data Business Unit 14
Agenda
Business Need1
“Cooking” Predictions2
Gathering ingredients3
Cleaning and Transforming4
The recipe (the model)5
Tasting the dish6
14. CleverTask Solutions SL - Big Data Business Unit
Tasting the Dish
Cooking
Transforming
15
“Cooking” Predictions2
Go to the market to buy ingredients
Cleaning
15. CleverTask Solutions SL - Big Data Business Unit
Evaluating Prediction Quality
Training the Model
Transforming and Feature Engineering
15
“Cooking” Predictions2
Gathering RAW data
Cleaning Data
16. CleverTask Solutions SL - Big Data Business Unit 16
Agenda
Business Need1
“Cooking” Predictions2
Gathering ingredients3
Cleaning and Transforming4
The recipe (the model)5
Tasting the dish6
17. CleverTask Solutions SL - Big Data Business Unit 17
Where does Data come from?
Own Website
Partners Websites
RAW Data
18. CleverTask Solutions SL - Big Data Business Unit 18
RAW Data
One year historical
reservation data
(.xlsx file)
Characteristics
•260.000 reservations
•80 fields
•57 categorical
•9 numeric
•10 date
•3 text
•1 incorrect field
•Size: 150 MB
20. CleverTask Solutions SL - Big Data Business Unit 20
Agenda
Business Need1
“Cooking” Predictions2
Gathering ingredients3
Cleaning and Transforming4
The recipe (the model)5
Tasting the dish6
21. CleverTask Solutions SL - Big Data Business Unit
“Dirty” RAW Data
Gathering Data
21
The Process
New Fields
1 3 4
Transformation
and Feature
Engineering
“Clean” Data
Calculated Fields
2
Cleaning Model
28. CleverTask Solutions SL - Big Data Business Unit 23
Data Cleaning
Row Deletion
• Reservations without
check-in
• Cancelled reservations
• Rows with errors
Column Deletion
• IDs vs names
• Columns with little data
Other Actions
• Give dates a format
• Delete accents
• Transform .xlsx -> .csv
29. CleverTask Solutions SL - Big Data Business Unit 24
Clean Dataset
Clean
•150.000 reservations
•46 fields
•26 categorical
•9 numeric
•10 data
•1 text
•Size: 75MB
Dirty
•260.000 reservations
•80 fields
•57 categorical
•9 numeric
•10 data
•3 text
•1 incorrect field
•Size: 150 MB
30. CleverTask Solutions SL - Big Data Business Unit
“Dirty” RAW Data
Gathering Data
25
The Process
New Fields
1 3 4
Transformations
and Feature
Engineering
“Clean” Data
Calculated Fields
2
Cleaning Model
31. CleverTask Solutions SL - Big Data Business Unit 26
Transformations
Country Grouping
•A lot of countries to predict
(210)
•Some countries have very few
instances
•Grouping objective: mín. 1% of
total instances
• Does not affect business
objective
•Total number of groups: 20
New Fields
• RESERV_ANTICIPATION (calculated):
(reservation date - checkin date)
• COUNTRY_HOTEL (name of the
country)
• HOTEL_STARS (1-5)
32. CleverTask Solutions SL - Big Data Business Unit 27
Clean Dataset
Clean
•150.000 reservations
•46 fields
•Size: 75MB
Dirty
•260.000 reservations
•80 fields
•Size: 150 MB
Transformed
•150.000 registers
•49 fields
•Size: 80MB
33. CleverTask Solutions SL - Big Data Business Unit 28
What is Feature Engineering
Extract signal from noise
34. CleverTask Solutions SL - Big Data Business Unit 29
Feature Engineering
Techniques
• Detecta fields (features) that are predictorss
(signal) and bypass those that are not (noise)
• Dependand fields (pax, days, pax*days)
• Needless fields (reservation number)
• Fields with very little data
• Random fields (minute and second of reservation)
• Domain knowledge
• Experience
• Recursive cycle
35. CleverTask Solutions SL - Big Data Business Unit 30
Field
Selection
Algorithm
Adjustment
Prediction
Quality
Evaluation
Recursive Feature
Engineering
36. CleverTask Solutions SL - Big Data Business Unit 31
Clean Dataset
Clean
•150.000 reservations
•46 fields
•Size: 75MB
Dirty
•260.000 reservations
•80 fields
•Size: 150 MB
Transformed
•150.000 registers
•49 fields
•Size: 80MB
Final Dataset
•150.000 registers
•10 fields
•Size: 55MB
37. CleverTask Solutions SL - Big Data Business Unit 32
Agenda
Business Need1
“Cooking” Predictions2
Gathering ingredients3
Cleaning and Transforming4
The recipe (the model)5
Tasting the dish6
38. CleverTask Solutions SL - Big Data Business Unit 33
The Process
“Dirty” RAW Data
New Fields
1 3 4
Gathering Data
Transformation
and Feature
Engineering
“Clean” Data
Calculated
2
Cleaning Modeling
41. CleverTask Solutions SL - Big Data Business Unit 37
Agenda
Business Need1
“Cooking” Predictions2
Gathering ingredients3
Cleaning and Transforming4
The recipe (the model)5
Tasting the dish6
42. CleverTask Solutions SL - Big Data Business Unit 38
Quality Evaluation
80%
20% Evaluation
Training
Test
Dataset
100%
Modelo
43. CleverTask Solutions SL - Big Data Business Unit 39
Quality Evaluation
Accuracy Confusion Matrix
45. CleverTask Solutions SL - Big Data Business Unit 41
Quality Evaluation
Predicted vs Real Distribution
46. CleverTask Solutions SL - Big Data Business Unit 42
Cooking Predictions
80%
20%
Tasting the Dish
Cooking
Transforming
Go to the market to buy ingredients
Cleaning
47. CleverTask Solutions SL - Big Data Business Unit 42
Cooking Predictions
80%
20%
Evaluating Prediction Quality
Training the Model
Transforming and Feature Engineering
Gathering RAW data
Cleaning Data
48. CleverTask Solutions SL - Big Data Business Unit 43
Other Techniques
Ensembles Clusters
Weight Analysis Anomaly Detection
49. CleverTask Solutions SL - Big Data Business Unit 44
END
email: andresg@clevertask.com
Twitter: @data_lytics
www.clevertask.com