SlideShare uma empresa Scribd logo
1 de 35
Baixar para ler offline
Machine Learning to moderate
ads in real world classified's
business
by Vaibhav Singh & Jaroslaw Szymczak
Agenda
● Moderation problem
● Offline model creation
○ feature generation
○ feature selection
○ data leakage
○ the algorithm
● Model evaluation
● Going live with the product
○ is your data really big?
○ automatic model creation pipeline
○ consistent development and production environments
○ platform architecture
○ performance monitoring
50+
countries
60+ million
new monthly listings
18+ million
unique monthly sellers
What do moderators look for?
Avoidance of payment
Sell another item in paid
listing by changing its
content
Flood site with duplicate
posts to increase
visibility
Create multiple accounts
to bypass free ad per
user limit
Violation of ToS
Add Phone numbers,
Company information on
image rather than in
description or dedicated
fields
Try to sell forbidden
items, very often with
title and description that
try to evade keyword
filters
Miscategorized listings
Item is placed in wrong
category
Item is coming from
legitimate business, but
is marked as coming
from individual
‘Seek’ problem in job
offers
Offline model creation
Feature
engineering...
… and selection
Feature selection:
● necessary for some
algorithms, for others -
not so much
● most important features
● avoiding leakage
Feature generation - one-hot-encoding
Feature generation - feature hashing
Feature hashing
➔ Good when dealing high
dimensional, sparse features --
dimensionality reduction
➔ Memory efficient
➔ Cons - Getting back to feature
names is difficult
➔ Cons - Hash collisions can have
negative effects
Data Leakage
➔ Remove obvious fields
e.g.: id, account numbers
➔ Check the importance of
the features for any
unusual observations
➔ Have hold-out set that you
do not process wrt. target
variable
➔ Closely monitor live
performance
The algorithm
Desired features:
● state-of-the-art structured
binary problems
● allowing reducing variance
errors (overfitting)
● allowing reducing bias errors
(underfitting)
● has efficient implementation
eXtreme Gradient Boosting (XGBoost)
Source: https://www.slideshare.net/JaroslawSzymczak1/xgboost-the-algorithm-that-wins-every-competition
Model evaluation
Beyond accuracy
● ROC AUC (Receiver-Operator Curve):
○ can be interpreted as concordance probability (i.e. random positive example has the
probability equal to AUC, that it’s score is higher)
○ it is too abstract to use as a standalone quality metric
○ does not depend on classes ratio
● PRC AUC (Precision-Recall Curve)
○ Depends on data balance
○ Is not intuitively interpretable
● Precision @ fixed Recall, Recall @ fixed Precision:
○ can be found using thresholding
○ they heavily depend on data balance
○ they are the best to reflect the business requirements
○ and to take into account processing capabilities
(then actually Precision @k is more accurate)
● choose one, and only one as your KPI and others as
constraints
Example ROC for moderation problem
Precision-recall curve example
Precision @recall
Recall @precision
Going live with the product
Is your data
really big?
SVM Light
Data Format
➔ Memory Efficient.
Features can be created
on one machine and do
not require huge clusters
➔ Cons - Number of
features is unknown,
store it separately
1 191:-0.44 87214:-0.44 200004:0.20 200012:1 206976:1 206983:-1 207015:1 207017:1 226201:1
1 1738:0.57 130440:-0.57 206999:0.32 207000:28 207001:6 207013:1 207015:1 207017:1 226300:1
0 2812:-0.63 34755:-0.31 206995:2.28 206997:1 206998:2 206999:0.00 207000:1 207001:28 226192:1
1 4019:0.35 206999:0.43 207000:40 207001:18 207013:1 207014:1 207016:1 226261:1
0 8903:0.37 207000:4 207001:14 207013:1 207014:1 207016:1 226262:1
1 5878:-0.27 206995:2.28 206998:1 206999:5.80 207000:1 207001:24 226187:1
Lessons Learnt
➔ Do not go for distributed learning if you
don’t need to
➔ Choose your tech dependent on data size.
Do not go for hype driven development
➔ Your machine does not limit, there’s cloud
➔ Ask yourself: What’s the most difficult
problem to scale ? → People
Model Generation Pipeline
Automatic
model creation
pipeline
● Automation makes things
deterministic
● Airflow, Luigi and many others
are good choice for Job
dependency management
Luigi Dashboard
Luigi Task Visualizer
Lessons Learnt
➔ when you use the output path on your own,
create your output at the very end of the
task
➔ you can dynamically create dependencies
by yielding the task
➔ adding workers parameter to your
command parallelizes task that are ready
to be run (e.g. python run.py Task …
--workers 15)
Consistent development
and production
environments
Model Serving Architecture
Flask API
Queue Prediction
Module
Mongo
Monitoring & Stats
Graphite, Grafana
Learning
Module
Scikit
XGBoost
Luigi
Ask Prediction
Return Prediction
Learning Ads
Image Model Serving Architecture
AWS Kinensis
Stream
Incoming
Pictures
Hash Generation
Country Specific Image
Moderation
General Moderation NSFW
Tag and Category
Prediction
Mongo
OLX Site
S3
Models
GPU Clusters
Learning Cluster
TF, Keras, MxNet
Performance monitoring
Model monitoring and management
Lessons Learnt
➔ Always Batch
Batching will reduce CPU Utilization and the same machines
would be able to handle much more requests
➔ Modularize, Dockerize and Orchestrate
Containerize your code so that it is transparent to Machine
configurations
➔ Monitoring
Use a monitoring service
➔ Choose simple and easy tech
Acknowledgements
● Andrzej Prałat
● Wojciech Rybicki
Vaibhav Singh
vaibhav.singh@olx.com
Jaroslaw Szymczak
jaroslaw.szymczak@olx.com
PYDATA BERLIN 2017
July 2nd
, 2017

Mais conteúdo relacionado

Semelhante a Machine Learning to moderate ads in real world classified's business

Predictive Analytics Project in Automotive Industry
Predictive Analytics Project in Automotive IndustryPredictive Analytics Project in Automotive Industry
Predictive Analytics Project in Automotive IndustryMatouš Havlena
 
MOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDCMOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDCgdgsurrey
 
Live predictions with schemaless data at scale. MLMU Kosice, Exponea
Live predictions with schemaless data at scale. MLMU Kosice, ExponeaLive predictions with schemaless data at scale. MLMU Kosice, Exponea
Live predictions with schemaless data at scale. MLMU Kosice, ExponeaData Science Club
 
Moving from BI to AI : For decision makers
Moving from BI to AI : For decision makersMoving from BI to AI : For decision makers
Moving from BI to AI : For decision makerszekeLabs Technologies
 
Group 3 slide presentation
Group 3 slide presentationGroup 3 slide presentation
Group 3 slide presentationMichael Young
 
Moving from BI to AI : For decision makers
Moving from BI to AI : For decision makersMoving from BI to AI : For decision makers
Moving from BI to AI : For decision makerszekeLabs Technologies
 
Strangle The Monolith: A Data Driven Approach
Strangle The Monolith: A Data Driven ApproachStrangle The Monolith: A Data Driven Approach
Strangle The Monolith: A Data Driven ApproachVMware Tanzu
 
Recommendations for Building Machine Learning Software
Recommendations for Building Machine Learning SoftwareRecommendations for Building Machine Learning Software
Recommendations for Building Machine Learning SoftwareJustin Basilico
 
The Machine Learning Audit
The Machine Learning AuditThe Machine Learning Audit
The Machine Learning AuditAndrew Clark
 
Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...
Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...
Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...Aaron Saray
 
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligenceSpark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligenceWei Di
 
Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...
Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...
Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...Databricks
 
Automatic image moderation in classifieds, Jarosław Szymczak
Automatic image moderation in classifieds, Jarosław SzymczakAutomatic image moderation in classifieds, Jarosław Szymczak
Automatic image moderation in classifieds, Jarosław SzymczakPôle Systematic Paris-Region
 
Big data and other buzzwords
Big data and other buzzwordsBig data and other buzzwords
Big data and other buzzwordsAndrew Clark
 
Transforming B2B Sales with Spark Powered Sales Intelligence
Transforming B2B Sales with Spark Powered Sales IntelligenceTransforming B2B Sales with Spark Powered Sales Intelligence
Transforming B2B Sales with Spark Powered Sales IntelligenceSongtao Guo
 
AppDynamics User Group
AppDynamics User GroupAppDynamics User Group
AppDynamics User GroupMike Ruangutai
 
Using the Business Process Technology Workflow Engine for Advanced Modeling
Using the Business Process Technology Workflow Engine for Advanced ModelingUsing the Business Process Technology Workflow Engine for Advanced Modeling
Using the Business Process Technology Workflow Engine for Advanced ModelingOutSystems
 
Scaling & Transforming Stitch Fix's Visibility into What Folks will love
Scaling & Transforming Stitch Fix's Visibility into What Folks will loveScaling & Transforming Stitch Fix's Visibility into What Folks will love
Scaling & Transforming Stitch Fix's Visibility into What Folks will loveJune Andrews
 

Semelhante a Machine Learning to moderate ads in real world classified's business (20)

Predictive Analytics Project in Automotive Industry
Predictive Analytics Project in Automotive IndustryPredictive Analytics Project in Automotive Industry
Predictive Analytics Project in Automotive Industry
 
MOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDCMOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDC
 
Live predictions with schemaless data at scale. MLMU Kosice, Exponea
Live predictions with schemaless data at scale. MLMU Kosice, ExponeaLive predictions with schemaless data at scale. MLMU Kosice, Exponea
Live predictions with schemaless data at scale. MLMU Kosice, Exponea
 
Moving from BI to AI : For decision makers
Moving from BI to AI : For decision makersMoving from BI to AI : For decision makers
Moving from BI to AI : For decision makers
 
Group 3 slide presentation
Group 3 slide presentationGroup 3 slide presentation
Group 3 slide presentation
 
Moving from BI to AI : For decision makers
Moving from BI to AI : For decision makersMoving from BI to AI : For decision makers
Moving from BI to AI : For decision makers
 
Strangle The Monolith: A Data Driven Approach
Strangle The Monolith: A Data Driven ApproachStrangle The Monolith: A Data Driven Approach
Strangle The Monolith: A Data Driven Approach
 
Recommendations for Building Machine Learning Software
Recommendations for Building Machine Learning SoftwareRecommendations for Building Machine Learning Software
Recommendations for Building Machine Learning Software
 
MLOps.pptx
MLOps.pptxMLOps.pptx
MLOps.pptx
 
The Machine Learning Audit
The Machine Learning AuditThe Machine Learning Audit
The Machine Learning Audit
 
Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...
Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...
Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...
 
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligenceSpark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligence
 
Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...
Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...
Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...
 
Automatic image moderation in classifieds, Jarosław Szymczak
Automatic image moderation in classifieds, Jarosław SzymczakAutomatic image moderation in classifieds, Jarosław Szymczak
Automatic image moderation in classifieds, Jarosław Szymczak
 
Big data and other buzzwords
Big data and other buzzwordsBig data and other buzzwords
Big data and other buzzwords
 
Transforming B2B Sales with Spark Powered Sales Intelligence
Transforming B2B Sales with Spark Powered Sales IntelligenceTransforming B2B Sales with Spark Powered Sales Intelligence
Transforming B2B Sales with Spark Powered Sales Intelligence
 
Introduction-To-RPA_1.pptx
Introduction-To-RPA_1.pptxIntroduction-To-RPA_1.pptx
Introduction-To-RPA_1.pptx
 
AppDynamics User Group
AppDynamics User GroupAppDynamics User Group
AppDynamics User Group
 
Using the Business Process Technology Workflow Engine for Advanced Modeling
Using the Business Process Technology Workflow Engine for Advanced ModelingUsing the Business Process Technology Workflow Engine for Advanced Modeling
Using the Business Process Technology Workflow Engine for Advanced Modeling
 
Scaling & Transforming Stitch Fix's Visibility into What Folks will love
Scaling & Transforming Stitch Fix's Visibility into What Folks will loveScaling & Transforming Stitch Fix's Visibility into What Folks will love
Scaling & Transforming Stitch Fix's Visibility into What Folks will love
 

Último

Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 

Último (20)

Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 

Machine Learning to moderate ads in real world classified's business

  • 1. Machine Learning to moderate ads in real world classified's business by Vaibhav Singh & Jaroslaw Szymczak
  • 2. Agenda ● Moderation problem ● Offline model creation ○ feature generation ○ feature selection ○ data leakage ○ the algorithm ● Model evaluation ● Going live with the product ○ is your data really big? ○ automatic model creation pipeline ○ consistent development and production environments ○ platform architecture ○ performance monitoring
  • 3. 50+ countries 60+ million new monthly listings 18+ million unique monthly sellers
  • 4. What do moderators look for? Avoidance of payment Sell another item in paid listing by changing its content Flood site with duplicate posts to increase visibility Create multiple accounts to bypass free ad per user limit Violation of ToS Add Phone numbers, Company information on image rather than in description or dedicated fields Try to sell forbidden items, very often with title and description that try to evade keyword filters Miscategorized listings Item is placed in wrong category Item is coming from legitimate business, but is marked as coming from individual ‘Seek’ problem in job offers
  • 6. Feature engineering... … and selection Feature selection: ● necessary for some algorithms, for others - not so much ● most important features ● avoiding leakage
  • 7. Feature generation - one-hot-encoding
  • 8. Feature generation - feature hashing
  • 9. Feature hashing ➔ Good when dealing high dimensional, sparse features -- dimensionality reduction ➔ Memory efficient ➔ Cons - Getting back to feature names is difficult ➔ Cons - Hash collisions can have negative effects
  • 10. Data Leakage ➔ Remove obvious fields e.g.: id, account numbers ➔ Check the importance of the features for any unusual observations ➔ Have hold-out set that you do not process wrt. target variable ➔ Closely monitor live performance
  • 11. The algorithm Desired features: ● state-of-the-art structured binary problems ● allowing reducing variance errors (overfitting) ● allowing reducing bias errors (underfitting) ● has efficient implementation
  • 12. eXtreme Gradient Boosting (XGBoost) Source: https://www.slideshare.net/JaroslawSzymczak1/xgboost-the-algorithm-that-wins-every-competition
  • 14.
  • 15. Beyond accuracy ● ROC AUC (Receiver-Operator Curve): ○ can be interpreted as concordance probability (i.e. random positive example has the probability equal to AUC, that it’s score is higher) ○ it is too abstract to use as a standalone quality metric ○ does not depend on classes ratio ● PRC AUC (Precision-Recall Curve) ○ Depends on data balance ○ Is not intuitively interpretable ● Precision @ fixed Recall, Recall @ fixed Precision: ○ can be found using thresholding ○ they heavily depend on data balance ○ they are the best to reflect the business requirements ○ and to take into account processing capabilities (then actually Precision @k is more accurate) ● choose one, and only one as your KPI and others as constraints
  • 16. Example ROC for moderation problem
  • 20. Going live with the product
  • 22. SVM Light Data Format ➔ Memory Efficient. Features can be created on one machine and do not require huge clusters ➔ Cons - Number of features is unknown, store it separately 1 191:-0.44 87214:-0.44 200004:0.20 200012:1 206976:1 206983:-1 207015:1 207017:1 226201:1 1 1738:0.57 130440:-0.57 206999:0.32 207000:28 207001:6 207013:1 207015:1 207017:1 226300:1 0 2812:-0.63 34755:-0.31 206995:2.28 206997:1 206998:2 206999:0.00 207000:1 207001:28 226192:1 1 4019:0.35 206999:0.43 207000:40 207001:18 207013:1 207014:1 207016:1 226261:1 0 8903:0.37 207000:4 207001:14 207013:1 207014:1 207016:1 226262:1 1 5878:-0.27 206995:2.28 206998:1 206999:5.80 207000:1 207001:24 226187:1
  • 23. Lessons Learnt ➔ Do not go for distributed learning if you don’t need to ➔ Choose your tech dependent on data size. Do not go for hype driven development ➔ Your machine does not limit, there’s cloud ➔ Ask yourself: What’s the most difficult problem to scale ? → People
  • 25. Automatic model creation pipeline ● Automation makes things deterministic ● Airflow, Luigi and many others are good choice for Job dependency management
  • 28. Lessons Learnt ➔ when you use the output path on your own, create your output at the very end of the task ➔ you can dynamically create dependencies by yielding the task ➔ adding workers parameter to your command parallelizes task that are ready to be run (e.g. python run.py Task … --workers 15)
  • 30. Model Serving Architecture Flask API Queue Prediction Module Mongo Monitoring & Stats Graphite, Grafana Learning Module Scikit XGBoost Luigi Ask Prediction Return Prediction Learning Ads
  • 31. Image Model Serving Architecture AWS Kinensis Stream Incoming Pictures Hash Generation Country Specific Image Moderation General Moderation NSFW Tag and Category Prediction Mongo OLX Site S3 Models GPU Clusters Learning Cluster TF, Keras, MxNet
  • 33. Model monitoring and management
  • 34. Lessons Learnt ➔ Always Batch Batching will reduce CPU Utilization and the same machines would be able to handle much more requests ➔ Modularize, Dockerize and Orchestrate Containerize your code so that it is transparent to Machine configurations ➔ Monitoring Use a monitoring service ➔ Choose simple and easy tech
  • 35. Acknowledgements ● Andrzej Prałat ● Wojciech Rybicki Vaibhav Singh vaibhav.singh@olx.com Jaroslaw Szymczak jaroslaw.szymczak@olx.com PYDATA BERLIN 2017 July 2nd , 2017