O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
From Labelling Open Data Images to
Building a Private Recommender System
A transfer learning application
Outline
•  Introduction
•  Iterative building of a recommender system
•  Labelling images
AKA: Pragmatic Deep learning for...
Dataiku
•  Founded in 2013
•  60 + employees
•  Paris, New-York, London, San Francisco
Data Science Software Editor of Dat...
•  E-business vacation retailer
•  Founded in 2006. 500M revenue in 2015.
•  18 Millions of clients.
•  Hundreds of sales ...
VPG specificities
•  Sales are very temporary
-> Unlike amazon / Price Minister / Cdiscount
-> Some classical recommender ...
A data science workflow
Six steps to a predictive model
Data
Exploration &
Understanding
Data
Preparation
Model Creation
E...
A data science workflow
Six steps to a predictive model
Data
Exploration &
Understanding
Data
Preparation
Model Creation
E...
Iterative Building of a Recommender System
Basic Recommendation Engines
Other Factors
One Meta Model to Rule Them All
Recommenders	
  as	
  features	
  
Machine	
  learning	
  to	
  op5mize	
  
purchasing	
  ...
One Meta Model to Rule Them All
•  Negative sampling
•  Take all purchases tuples : (user, product, timestamp)-> 1
•  Sele...
One Meta Model to Rule Them All
•  Going further ?
•  Predict the visit ?
-  Would enable to take account more information...
Cleaning, combining
and enrichment of
data
Recommendation
Engines
Optimization of
home display
the application
automatical...
Why use Image ?
We want do distinguish
« Sun and
Beach »
« Ski »
A picture is worth a thousand words
Sales Images
Integrating Image Information
Labelling Model
Pool + Palm Trees Hotel
+ Mountains
Pool + Forest + Hotel + Sea...
Image Labelling For Recommendation Engine
Pragma&c	
  Deep	
  learning	
  for	
  “Dummies”	
  
Using Deep Learning models
Common Issues
“I don’t have GPUs server” “I don’t have a deep leaning expert”
“I don’t have lab...
Pragmatic Deep Learning Cheat Sheet
Do	
  you	
  have	
  
Labels	
  ?	
  
Many	
  ?	
  	
  
Are	
  you	
  
sure	
  ?	
  
T...
“I don’t have (or few) labelled data”
-> Is there similar data ?
Solution 1 : Pre trained models
PLACES	
  DATABASE	
  VPG...
tower: 0.53
skyscraper: 0.26
swimming_pool/outdoor: 0.65
inn/outdoor: 0.06
Solution 1 : Pre trained models
If there is ope...
Solution 2 : Transfer Learning
“I want to add information of SUN database”
“But I have only 100 K images”
If you know how ...
Solution 2 : Transfer Learning
Not limited to images !
Pan, Sinno Jialin, and Qiang Yang. "A survey on transfer learning."...
Solution 2 : Transfer Learning
Credit	
  :	
  	
  Fei-­‐Fei	
  Li	
  &	
  Andrej	
  Karpathy	
  &	
  Jus5n	
  Johnson	
  h...
Retrain new network
Solution 2 : Transfer Learning
Similar Data Not so similar Data
Use network as
transformer
Simple mode...
PLACES	
  DATABASE	
   VOYAGE	
  PRIVE	
  SUN	
  DATABASE	
  
Training	
  
(op5onal)	
  
Pre-­‐trained	
  model	
  
VGG16	...
Solution 3 : Generating your own large (or not) dataset
•  Create Label Set
•  Easy : Man VS Woman ?
•  Harder : all relev...
Solution 4 : What about APIs ?
Solution 4 : What about APIs ?
•  Price
•  Their cost
often rather cheap. Ex: 100 K request for less than 300$
•  VS the o...
What about APIs ? Use for generating labels !
•  How to :
•  Score part of the database for training
•  Train a model
•  S...
What about APIs ? Use for generating labels !
Experiment:
•  5000 requests on API
-> 4500 for training
-> 500 for validati...
What about APIs ? Results
Accuracy	
   95	
  
Recall	
   80	
  
Precision	
   75	
  
Label	
   Probability	
   Label	
   P...
Post Treatment
(Or how we transfer the labelling information)
Using	
  Images	
  informa&on	
  for	
  BI	
  on	
  steroids...
Classification problem
•  Only have probabilities of each class
•  Selecting based on probability threshold fails
•  Keepi...
Labels post-processing
Complementary information Redondant information
Issue with our approach:
Solution : Matrix Factoriz...
Topic extraction with Non-Negative Matrix Factorization
•  Non Negative Matrix factorization (NMF)
X = WH
•  X : image x t...
Image content detection
Topic scores determine the importance of topics in an image
TOPIC	
   TOPIC	
  SCORE	
  (%)	
  
Go...
Note on model performance
•  Images labels are used for similarity
Calling herb field “putting green”:
•  Is not important...
Results
Results ?
All Visits :
•  Mostly France
•  Pool displayed
First Recommendation
•  Fail to display pools
Only Images ?
•  P...
Results ?
All Visits :
•  Spain
•  Sun & Beach
•  Pool displayed
First Recommendation
•  Displays nature…
Only Images ?
• ...
•  Do iterative data science !
•  Start simple and grow
•  Validate each steps
•  Image labelling = BI on steroids
•  Deep...
Learned along the way
For ski sales, showing indoor pictures performs better
What’s next ?
•  Comparison proposed/visited ...
Learned along the way
For ski sales, showing indoor pictures performs better
What’s next ?
What’s Next ?
Kenya
Prague
Berlin
Cambodia
What’s Next ? Customize the Image !
Kenya
Prague
Berlin
Cambodia
Thank you for your attention !
Próximos SlideShares
Carregando em…5
×

From Labelling Open data images to building a private recommender system

1.498 visualizações

Publicada em

Recommender systems are paramount for e-business companies. There is an increasing need to take into account all the user information to tailor the best product proposition. One of them is the content that the user actually sees: the visual of the product.

When it comes to hostels, some people can be more attracted by pictures of the room, the building or even the nearby beach.

In this talk, we will describe how we improved an e-business vacation retailer recommender system using the content of images. We’ll explain how to leverage open dataset and pre-trained deep learning models to derive user taste information. This transfer learning approach enables companies to use state of the art machine learning methods without having deep learning expertise.

Publicada em: Dados e análise

From Labelling Open data images to building a private recommender system

  1. 1. From Labelling Open Data Images to Building a Private Recommender System A transfer learning application
  2. 2. Outline •  Introduction •  Iterative building of a recommender system •  Labelling images AKA: Pragmatic Deep learning for “Dummies” •  Post processing AKA: Using Images information for BI on steroids •  Results & Conclusion
  3. 3. Dataiku •  Founded in 2013 •  60 + employees •  Paris, New-York, London, San Francisco Data Science Software Editor of Dataiku DSS DESIGN Load and prepare your data PREPARE Build your models MODEL Visualize and share your work ANALYSE Re-execute your workflow at ease AUTOMATE Follow your production environment MONITOR Get predictions in real time SCORE PRODUCTION
  4. 4. •  E-business vacation retailer •  Founded in 2006. 500M revenue in 2015. •  18 Millions of clients. •  Hundreds of sales everyday -> recommendation engine •  Sale Image is paramount Key Figures
  5. 5. VPG specificities •  Sales are very temporary -> Unlike amazon / Price Minister / Cdiscount -> Some classical recommender system fails -> Sales are event linked (Christmas, ski, summer) •  Expensive Product -> Few recurrent buyers -> Appearance counts a lot •  Few recurrent buyer -> Classical approach fail. -> Less signal. Visit information paramount. -> less inclined to browse a lot (4-10 first sales)
  6. 6. A data science workflow Six steps to a predictive model Data Exploration & Understanding Data Preparation Model Creation Evaluation Deployment Data Acquisition Dataset 1 Scored dataset Scored dataset Iteration 1 Iteration 2 Iteration n Creating a predictive model is an highly iterative process. Data Science Studio enables its users to create and manage these projects from end-to-end. This process is not industry specific, and can be applied to many use cases. Dataset 2 Dataset n Business Understanding Adapted from the CRISP-DM methodology
  7. 7. A data science workflow Six steps to a predictive model Data Exploration & Understanding Data Preparation Model Creation Evaluation Deployment Data Acquisition Dataset 1 Scored dataset Scored dataset Iteration 1 Iteration 2 Iteration n Creating a predictive model is an highly iterative process. Data Science Studio enables its users to create and manage these projects from end-to-end. This process is not industry specific, and can be applied to many use cases. Dataset 2 Dataset n Business Understanding Adapted from the CRISP-DM methodology
  8. 8. Iterative Building of a Recommender System
  9. 9. Basic Recommendation Engines
  10. 10. Other Factors
  11. 11. One Meta Model to Rule Them All Recommenders  as  features   Machine  learning  to  op5mize   purchasing  probability   Combine   Recommend   Describe  
  12. 12. One Meta Model to Rule Them All •  Negative sampling •  Take all purchases tuples : (user, product, timestamp)-> 1 •  Select 5 sales open at the same date the user did not buy -> 0 •  The model directly optimize purchasing probability •  Machine learning model •  Features : recommender systems. •  Logistic Regression Regularizing effect : we don’t want to overfit leaks. •  Reranking approach. Similar to Google or Yandex (Kaggle challenge)
  13. 13. One Meta Model to Rule Them All •  Going further ? •  Predict the visit ? -  Would enable to take account more information -  Many people browse randomly •  Learning to rank on target: 2 bought, 1 visited, 0 elsewhere •  Impact of this on top 10 sales ? •  Limitations : •  Highly dependant on ranking displayed - which we don’t have - may overfit old man made rules.
  14. 14. Cleaning, combining and enrichment of data Recommendation Engines Optimization of home display the application automatically runs and compiles heterogeneous data Generation of recommendations based on user behaviour Every customer is shown the 10 sales he is the most likely to buy Customer visits Purchases Sales Images Metal model combine recommendations to directly optimize purchasing probability Meta Model Recommender system for Home Page Ordering +7% revenue Sales information (A/B testing) Batch Scoring every night
  15. 15. Why use Image ? We want do distinguish « Sun and Beach » « Ski » A picture is worth a thousand words
  16. 16. Sales Images Integrating Image Information Labelling Model Pool + Palm Trees Hotel + Mountains Pool + Forest + Hotel + Sea Sea + Beach +Forest + Hotel Sales descriptions CONTENT  BASED   Recommender System
  17. 17. Image Labelling For Recommendation Engine Pragma&c  Deep  learning  for  “Dummies”  
  18. 18. Using Deep Learning models Common Issues “I don’t have GPUs server” “I don’t have a deep leaning expert” “I don’t have labelled data” (or too few) “I don’t have the time to wait for model training ” I don’t want to pay to pay for private apis” / “I’m afraid their labelling will change over time”
  19. 19. Pragmatic Deep Learning Cheat Sheet Do  you  have   Labels  ?   Many  ?     Are  you   sure  ?   Train  DL   model   Transfer   Learning   Is  there  a   similar   database  ?  Is  there  a   pre-­‐trained   model  ?   Create   your  own   Use  it  !   Y   Y   Y  N   N   N   N   Y   N  
  20. 20. “I don’t have (or few) labelled data” -> Is there similar data ? Solution 1 : Pre trained models PLACES  DATABASE  VPG   SUN  DATABASE   205  categories   2.5  M  images   307  categories   110  K  images  
  21. 21. tower: 0.53 skyscraper: 0.26 swimming_pool/outdoor: 0.65 inn/outdoor: 0.06 Solution 1 : Pre trained models If there is open data, there is an open pre trained model ! •  Kudos to the community •  Check the licensing Example  with  Places  (Caffe  Model  Zoo)  :    
  22. 22. Solution 2 : Transfer Learning “I want to add information of SUN database” “But I have only 100 K images” If you know how to recognize… after a little bit of training… you will be able to recognize Transfer Learning Use a network that knows how to see •  As a feature generator / transformer •  To be updated for the new problem
  23. 23. Solution 2 : Transfer Learning Not limited to images ! Pan, Sinno Jialin, and Qiang Yang. "A survey on transfer learning." IEEE Transactions on knowledge and data engineering 22.10 (2010): 1345-1359. If you know sentiment for Transfer Learning Word2Vec: Use large text corpora •  For grammar learning •  For synonym learning This wine taste great The most disgusting cheese ever 1 0 (word2vec) And you know synonyms and grammar This cheese tasted awful The best wine in town It’s easy to classify
  24. 24. Solution 2 : Transfer Learning Credit  :    Fei-­‐Fei  Li  &  Andrej  Karpathy  &  Jus5n  Johnson  h`p://cs231n.stanford.edu/slides/winter1516_lecture11.pdf  
  25. 25. Retrain new network Solution 2 : Transfer Learning Similar Data Not so similar Data Use network as transformer Simple model on shallow layers ? Or get other data Lot’s of labeled data With existing architecture Create Simple Model Troubles Fine Tune Few labeled data Credit  :    Fei-­‐Fei  Li  &  Andrej  Karpathy  &  Jus5n  Johnson    h`p://cs231n.stanford.edu/slides/winter1516_lecture11.pdf   Several layers depending on size of data SUN  VS  Places   dataset  J   VPG  :   •  No  labeled  data   •  Similar  data   ?  
  26. 26. PLACES  DATABASE   VOYAGE  PRIVE  SUN  DATABASE   Training   (op5onal)   Pre-­‐trained  model   VGG16   tower: 0.53 skyscraper: 0.26 Re-­‐Training   Transferred  Data  :   Last  convolu5onal   layer  features   Re-­‐trained  model   TensorFlow   2  fully  connected  layers   Caffe   Model  Zoo     GPU   CPU   GPU   Leverage existing knowledge ! Solution 2 : Transfer Learning Accuracy:  72%,  Top-­‐5  Acc:  90  %  >  state  of  the  art  on  dataset  alone  
  27. 27. Solution 3 : Generating your own large (or not) dataset •  Create Label Set •  Easy : Man VS Woman ? •  Harder : all relevant information in my images •  Manually select all words in a corpus (ex Wordnet) •  Use Search Engines •  Augment search terms •  Get URLs and images from search term •  Deduplicate •  Validate with Mechanical Turk •  Exclude incorrect images •  Evaluate human performance
  28. 28. Solution 4 : What about APIs ?
  29. 29. Solution 4 : What about APIs ? •  Price •  Their cost often rather cheap. Ex: 100 K request for less than 300$ •  VS the one of redeveloping (probably not as well) •  Full Database scoring •  APIs are often limited query per month. •  Make sure to be able to avoid cold start problem •  Stability •  Use model versioning •  Avoid covariate shift, distribution drift
  30. 30. What about APIs ? Use for generating labels ! •  How to : •  Score part of the database for training •  Train a model •  Score your entire database •  But I have only 5000 requests ? -> Use Transfer Learning ! •  Stealing models Tramèr, Florian, et al. "Stealing Machine Learning Models via Prediction APIs." arXiv preprint arXiv:1609.02943 (2016). (Or don’t, it’s illegal)
  31. 31. What about APIs ? Use for generating labels ! Experiment: •  5000 requests on API -> 4500 for training -> 500 for validation •  Transfer learning with MIT Places Pre-trained Model •  Scikit learn Multilabel model •  One Vs the Rest •  Untuned Logistic regression (Or don’t, it’s illegal) (demo, not used in any real project)
  32. 32. What about APIs ? Results Accuracy   95   Recall   80   Precision   75   Label   Probability   Label   Probability   landscape 1,0000 sunset 0,9998 sky 1,0000 no person 0,9996 outdoors 1,0000 water 0,9990 nature 1,0000 park 0,9849 rock 1,0000 river 0,9678 travel 1,0000 scenic 0,8031 Label   Probability   Label   Probability   beach 1,0000 ocean 1,0000 summer 1,0000 relaxation 1,0000 sand 1,0000 island 1,0000 tropical 1,0000 idyllic 1,0000 travel 1,0000 seashore 0,9998 seascape 1,0000 water 0,9997 (demo, not used in any real project)
  33. 33. Post Treatment (Or how we transfer the labelling information) Using  Images  informa&on  for  BI  on  steroids    
  34. 34. Classification problem •  Only have probabilities of each class •  Selecting based on probability threshold fails •  Keeping all information is not sparse -> we keep 5 labels and probabilities per image Labels post-processing Deep/Transfer Learning models 5-10 tags per images •  2s/image with CPU •  x20 speed up with GPU Voyage Privé images
  35. 35. Labels post-processing Complementary information Redondant information Issue with our approach: Solution : Matrix Factorization
  36. 36. Topic extraction with Non-Negative Matrix Factorization •  Non Negative Matrix factorization (NMF) X = WH •  X : image x tags, non negative •  W : image x theme •  H : theme x tag (scikit learn implementation) •  Most represented Themes •  Swimming-pool_Apartment_Putting-green •  Ocean_Coast_SandBar •  Coast_SeaCliff_RockArch •  Beach_Coast_BoardWalk •  Bridge_Viaduc_River •  Palace_BuildingFacade-Mansion •  Castle_Mansion_Monastery •  HotelRoom_Bedroom_DormRoom •  Dimension Reduction •  200x200 pixels -> 600 tags => 30 themes •  Faster content based filtering •  Image often sparse combination of themes Faster content based filtering •  Each theme has the same explication power Balanced vector for content based •  Explicability Each theme corresponds to a few labels
  37. 37. Image content detection Topic scores determine the importance of topics in an image TOPIC   TOPIC  SCORE  (%)   Golf  course  –  Fairway  –  PuPng  green   31   Hotel  –  Inn  –  Apartment  building  outdoor   30   Swimming  pool  –  Lido  Deck  –  Hot  tub  outdoor   22   Beach  –  Coast  -­‐  Harbor   17   TOPIC   TOPIC  SCORE  (%)   Tower  –  Skyscraper  –  Office  building   62   Bridge  –  River  –  Viaduct   38  
  38. 38. Note on model performance •  Images labels are used for similarity Calling herb field “putting green”: •  Is not important if all herbs field are called this way. •  Would be if we had lot’s of golf trips sales. •  Improving the NN performance ? •  Labels are used in NMF and reduced to themes •  Themes are used to calculate similarities for CB recommenders •  CB Recommenders are used as a feature in meta model •  Meta model give probabilities of purchase = order •  Users only check 10 sales… -> what is the change of online performance for 1% accuracy ?
  39. 39. Results
  40. 40. Results ? All Visits : •  Mostly France •  Pool displayed First Recommendation •  Fail to display pools Only Images ? •  Pool all around the world Third column = Right Mix
  41. 41. Results ? All Visits : •  Spain •  Sun & Beach •  Pool displayed First Recommendation •  Displays nature… Only Images ? •  Pool all around the world Third = Right Mix •  Get the bungalow feature !
  42. 42. •  Do iterative data science ! •  Start simple and grow •  Validate each steps •  Image labelling = BI on steroids •  Deep Learning ? •  Is there existing data ? •  Is there a pre-trained model ? •  Transfer Learning •  Cheaper, faster •  Any Data Scientist can do it •  What’s Next ? Conclusion
  43. 43. Learned along the way For ski sales, showing indoor pictures performs better What’s next ? •  Comparison proposed/visited vacation 𝑨𝒕𝒕𝒓𝒂𝒄𝒕𝒊𝒗𝒆𝒏𝒆𝒔𝒔( 𝒕𝒂𝒈)=  ​ 𝑽 𝒊𝒔𝒊𝒕𝒆𝒅   𝒐𝒇𝒇𝒆𝒓𝒔   𝒄𝒐𝒏𝒕𝒂𝒊𝒏𝒊𝒏𝒈   𝒕𝒂𝒈/𝑷𝒓𝒐𝒑𝒐𝒔𝒆𝒅   𝒐𝒇𝒇𝒆𝒓𝒔   𝒄𝒐𝒏𝒕𝒂𝒊𝒏𝒊𝒏𝒈   𝒕𝒂𝒈  Ocean   67%   Bedroom   33%   VPG  offers  database   Ocean   33%   Bedroo m   67%   Visits  database   •  Voyage Privé offers database = baseline •  « Bedroom » attractiveness = ​0.67/0.33  = 2 •  « Ocean » attractiveness = ​0.33/0.67  = 0.5
  44. 44. Learned along the way For ski sales, showing indoor pictures performs better What’s next ?
  45. 45. What’s Next ? Kenya Prague Berlin Cambodia
  46. 46. What’s Next ? Customize the Image ! Kenya Prague Berlin Cambodia
  47. 47. Thank you for your attention !

×