SlideShare uma empresa Scribd logo
1 de 24
從瀏覽文章行為來預測使用者的性別
Using Browsing Behavior Log to Predict
User’s Gender
Rick , Kent , Koi
Overview
● Huge Data Burn Money (燒錢啊)
o 28 Million PV / Day
o 7.7 Million UV / Day
o Have Total 4.4 Billion Articles
o Have Total 4.7 Million Registered User
● Only 2% Login , Who is 98% ?
Problem Definition
• Use Only 2% History Data to Prediction 98% users
Train
Model
User Model
To Predict
Training Data
Model
Unknown Cookie’ Gender Result
Training Flow
Training Data
Selection
Raw
Log
Target
Data
Preprocessing
Transformed
Data
Transformation
Data Mining
Pattern
取得最近三個月內的
有登入者瀏覽紀錄,
並且看過兩篇不同的
文上以上的使用者
使用 Naïve Bayes 演
算去來產生預測模型
• Feature Extraction
• Feature Selection
Prediction Flow
Selection
Raw
Log
Predict
Data
Preprocessing
Transformed
Data
Preprocessing
Transformation
Naive Bayes
Pattern
取得最近三個月內的
未登入者瀏覽紀錄,
數量約佔全站資料的
98% 使用 Naïve Bayes 演
算去來預測性別
Naive Bayes Formula
大至說穿了就是看看哪一個出現比較多次!!
Naive Bayes in Python Scikit-learn
http://scikit-learn.org/stable/modules/naive_bayes.html#bernoulli-naive-bayes
Raw Data (Matrix?)
Training Data Set Overview
Item Description Comment
Date 20150223 ~ 20150424
Total Click Counts 10908692
Login User Male : 149403
Female: 229448
Feature Before: 2543240
After : 508648
use chi-squre as feature
selection
Feature Extraction
• Category Feature -> Binary Feature
• Example
Feature Name Feature Value
Article Type A, B , C , D, E
Feature Name Feature Value
Article Type - A 0 ,1
Article Type - B 0 ,1
Article Type - C 0 ,1
Article Type - D 0 ,1
Article Type - E 0 ,1
Features List
Feature Name Description Example
gender the gender of login user 1 or 2
cat The article’s category 旅遊
url is a blog url http://kittyfish.pixnet.net/blog/post/345
566174
ariticle_author the blog’s author kittyfish
article_id the blog’s unique id 345566174
hours the time of click event 6
refers http://www.google.com/
country the country that predicted by ip address tw
But …… Too Many Features(又是燒錢)
• T = 2,450,000 x 2,543,240
• Many Irrelevant Feature for
Prediction
2,543,240 Feature
Feature Selection – Chi Square
http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html
http://www.slideshare.net/parth241989/chi-square-test-16093013
Chi Square Value Dependence with Result
Large High
Small Low
• 2543240 Features -> 508648
Features
• Precision 74% -> 81%
Important Feature is ?
feature_name male_prob female_prob male_count female_count total prob_distance
cat_財經企管 0.137798 0.045564 20587 10454 31041 0.184468
cat_美容彩妝 0.062211 0.137009 9294 31436 40730 0.149596
cat_時尚流行 0.079325 0.151936 11851 34861 46712 0.145221
cat_親子育兒 0.079640 0.133178 11898 30557 42455 0.107076
cat_心情日記 0.180942 0.231797 27033 53185 80218 0.101709
cat_國外旅遊 0.152288 0.194490 22752 44625 67377 0.084403
author_XXXXX 0.049975 0.009037 7466 2073 9539 0.081877
cat_食譜分享 0.054607 0.093596 8158 21475 29633 0.077978
cat_圖文創作 0.085483 0.122831 12771 28183 40954 0.074696
Important Feature is ?
• 以分類就可以初步判定性別傾向
• 部份特定作者及文章,可以特別用來識別是否為男性
• 男性點擊分佈特定傾向大於女性,這在後續使用 GA 作線上實驗,男性的預
測精準度是大於女性,不謀而合
Feature Distribution
少數的 feature 很具有引響力,但是其它的feature的長尾效應還是有的,對
於提升最後幾個百分點是有效力的
Prediction Set Data Analysis
Intersection/Training Intersection/Prediction
hour 100.00% 91.67%
author 94.37% 7.79%
country 100.00 2.46%
category 100.00 ???
article 84.53 2.64%
referer 94.50% 8.76%
Real War Record
Live Experiment on PIXNET
Falcon(Advertisement) System
Validation by Google Analytics
● Is God ?
● How to Use ?
UGD say
Male
UGD say
Female
GA Set 1
GA Set 2
GA Say
Male
GA Say
Female
GA Say
Male
GA Say
Female
An non-registration user
Classification Model
Prediction
Prediction Set Data Analysis
• 於由Prediction Data 遠高於 Training Data,故以 Training Set 為分母來看的
話,交集的比率頗高
• 但是以 Prediction Data 為分母的話,Article 、Author 、 Country、 Referer,
交集的比率均小於 10%,如下圖所示
• Article 及 Author 是因為 Pixnet 使用者的閱讀習慣集中在特定的文章,其
它的文章點擊次數非常的少,甚至沒有被其它人閱覽過
Prediction set Training
Set
Article 、Author 、Referrer Hour &Category
Prediction
Set
Training Set
Implementation - System Architecture
Implementation - Technology-Inventor List
Technology Tool Purpose
Scikit-learn Machine learning library
Redis Cookie profile database
Python Programing language
Celery Scheduling framework
Redshift Large raw data datawarehouse
Django & Rest framework Build api service for internal sytem
Implement - Performance Tuning
● CPU
● Batch Prediction
● 1000 x Speed Up
● Parallel Process
● Full usage mulit-core – 8 x Speed Up
● Python
● Memory
● Garbage Collection
● Python - del
Reference
● http://www.cs.ucr.edu/~eamonn/CE/Bayesian%20Classification%20withInsect_ex
amples.pdf
● https://www.iperceptions.com/~/media/files/knowledge/whitepapers/iperceptio
nsintentrecognitionenginewhitepaperfeb2014v13.ashx
● A Two-Stage Ensemble of Diverse Models for Advertisement ...
● http://www2.cs.uregina.ca/~dbd/cs831/notes/kdd/1_kdd.html
● Whyo use naive bayes : http://www.aaai.org/Papers/FLAIRS/2004/Flairs04-
097.pdf
● Unbias : http://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf

Mais conteúdo relacionado

Semelhante a Using browsing behavior history to predict user’s gender presenation

Analysis Report of Greek Blogosphere by DataMine.it
Analysis Report of Greek Blogosphere by DataMine.itAnalysis Report of Greek Blogosphere by DataMine.it
Analysis Report of Greek Blogosphere by DataMine.itGeorge Tziralis
 
Table of ContentsCase Study Hotel for Module #5 Written Analysis.docx
Table of ContentsCase Study Hotel for Module #5 Written  Analysis.docxTable of ContentsCase Study Hotel for Module #5 Written  Analysis.docx
Table of ContentsCase Study Hotel for Module #5 Written Analysis.docxmattinsonjanel
 
Cloud Lunch and Learn ML.NET MACHINE LEARNING (AND DEEP LEARNING) FOR THE CSh...
Cloud Lunch and Learn ML.NET MACHINE LEARNING (AND DEEP LEARNING) FOR THE CSh...Cloud Lunch and Learn ML.NET MACHINE LEARNING (AND DEEP LEARNING) FOR THE CSh...
Cloud Lunch and Learn ML.NET MACHINE LEARNING (AND DEEP LEARNING) FOR THE CSh...Luis Beltran
 
KDD capabilities 2016 v1.0
KDD capabilities 2016 v1.0KDD capabilities 2016 v1.0
KDD capabilities 2016 v1.0KDDanalytics
 
Utilizing Data Analytics and AI for Optimal Revenue Cycle Performance
Utilizing Data Analytics and AI for Optimal Revenue Cycle PerformanceUtilizing Data Analytics and AI for Optimal Revenue Cycle Performance
Utilizing Data Analytics and AI for Optimal Revenue Cycle PerformanceHealthcare Resource Group Inc.
 
Principles of Modern Marketing at NetSuite - Rob Israch (TOPO Demand Generati...
Principles of Modern Marketing at NetSuite - Rob Israch (TOPO Demand Generati...Principles of Modern Marketing at NetSuite - Rob Israch (TOPO Demand Generati...
Principles of Modern Marketing at NetSuite - Rob Israch (TOPO Demand Generati...TOPO
 
Microsoft analysis.pptx
Microsoft analysis.pptxMicrosoft analysis.pptx
Microsoft analysis.pptxMartin373349
 
SAS Analytics Presentation 2012 Retail ECOX Final1
SAS Analytics Presentation 2012 Retail ECOX Final1SAS Analytics Presentation 2012 Retail ECOX Final1
SAS Analytics Presentation 2012 Retail ECOX Final1Emmett Cox
 
There's no such thing as a hard to fill position! | Talent Connect San Franci...
There's no such thing as a hard to fill position! | Talent Connect San Franci...There's no such thing as a hard to fill position! | Talent Connect San Franci...
There's no such thing as a hard to fill position! | Talent Connect San Franci...LinkedIn Talent Solutions
 
03 GlobalAIBootcamp2020Lisboa-Rock, Paper, Scissors.pptx
03 GlobalAIBootcamp2020Lisboa-Rock, Paper, Scissors.pptx03 GlobalAIBootcamp2020Lisboa-Rock, Paper, Scissors.pptx
03 GlobalAIBootcamp2020Lisboa-Rock, Paper, Scissors.pptxLuis Beltran
 
Business and Data Analytics Collaborative April Meetup
Business and Data Analytics Collaborative April MeetupBusiness and Data Analytics Collaborative April Meetup
Business and Data Analytics Collaborative April MeetupKen Tucker
 
Leveraging Analytics In Gaming - Tiny Mogul Games
Leveraging Analytics In Gaming - Tiny Mogul GamesLeveraging Analytics In Gaming - Tiny Mogul Games
Leveraging Analytics In Gaming - Tiny Mogul GamesInMobi
 
Mihir Case Study one.pdf
Mihir Case Study one.pdfMihir Case Study one.pdf
Mihir Case Study one.pdfMihirA5
 
How Eastern Bank Uses Big Data to Better Serve and Protect its Customers
How Eastern Bank Uses Big Data to Better Serve and Protect its CustomersHow Eastern Bank Uses Big Data to Better Serve and Protect its Customers
How Eastern Bank Uses Big Data to Better Serve and Protect its CustomersBrian Griffith
 
How to Perform Churn Analysis for your Mobile Application?
How to Perform Churn Analysis for your Mobile Application?How to Perform Churn Analysis for your Mobile Application?
How to Perform Churn Analysis for your Mobile Application?Tatvic Analytics
 
Journey to the Perfect Application: Run a Business, not a Backlog
Journey to the Perfect Application: Run a Business, not a BacklogJourney to the Perfect Application: Run a Business, not a Backlog
Journey to the Perfect Application: Run a Business, not a BacklogAggregage
 
Analysis Report of Greek Blogosphere By MineKnowledge
Analysis Report of Greek Blogosphere By MineKnowledgeAnalysis Report of Greek Blogosphere By MineKnowledge
Analysis Report of Greek Blogosphere By MineKnowledgemineknowledge
 

Semelhante a Using browsing behavior history to predict user’s gender presenation (20)

Propensity models with logistic regression clarity
Propensity models with logistic regression clarityPropensity models with logistic regression clarity
Propensity models with logistic regression clarity
 
Analysis Report of Greek Blogosphere by DataMine.it
Analysis Report of Greek Blogosphere by DataMine.itAnalysis Report of Greek Blogosphere by DataMine.it
Analysis Report of Greek Blogosphere by DataMine.it
 
Table of ContentsCase Study Hotel for Module #5 Written Analysis.docx
Table of ContentsCase Study Hotel for Module #5 Written  Analysis.docxTable of ContentsCase Study Hotel for Module #5 Written  Analysis.docx
Table of ContentsCase Study Hotel for Module #5 Written Analysis.docx
 
Cloud Lunch and Learn ML.NET MACHINE LEARNING (AND DEEP LEARNING) FOR THE CSh...
Cloud Lunch and Learn ML.NET MACHINE LEARNING (AND DEEP LEARNING) FOR THE CSh...Cloud Lunch and Learn ML.NET MACHINE LEARNING (AND DEEP LEARNING) FOR THE CSh...
Cloud Lunch and Learn ML.NET MACHINE LEARNING (AND DEEP LEARNING) FOR THE CSh...
 
KDD capabilities 2016 v1.0
KDD capabilities 2016 v1.0KDD capabilities 2016 v1.0
KDD capabilities 2016 v1.0
 
Utilizing Data Analytics and AI for Optimal Revenue Cycle Performance
Utilizing Data Analytics and AI for Optimal Revenue Cycle PerformanceUtilizing Data Analytics and AI for Optimal Revenue Cycle Performance
Utilizing Data Analytics and AI for Optimal Revenue Cycle Performance
 
Principles of Modern Marketing at NetSuite - Rob Israch (TOPO Demand Generati...
Principles of Modern Marketing at NetSuite - Rob Israch (TOPO Demand Generati...Principles of Modern Marketing at NetSuite - Rob Israch (TOPO Demand Generati...
Principles of Modern Marketing at NetSuite - Rob Israch (TOPO Demand Generati...
 
Microsoft analysis.pptx
Microsoft analysis.pptxMicrosoft analysis.pptx
Microsoft analysis.pptx
 
SAS Analytics Presentation 2012 Retail ECOX Final1
SAS Analytics Presentation 2012 Retail ECOX Final1SAS Analytics Presentation 2012 Retail ECOX Final1
SAS Analytics Presentation 2012 Retail ECOX Final1
 
There's no such thing as a hard to fill position! | Talent Connect San Franci...
There's no such thing as a hard to fill position! | Talent Connect San Franci...There's no such thing as a hard to fill position! | Talent Connect San Franci...
There's no such thing as a hard to fill position! | Talent Connect San Franci...
 
03 GlobalAIBootcamp2020Lisboa-Rock, Paper, Scissors.pptx
03 GlobalAIBootcamp2020Lisboa-Rock, Paper, Scissors.pptx03 GlobalAIBootcamp2020Lisboa-Rock, Paper, Scissors.pptx
03 GlobalAIBootcamp2020Lisboa-Rock, Paper, Scissors.pptx
 
Business and Data Analytics Collaborative April Meetup
Business and Data Analytics Collaborative April MeetupBusiness and Data Analytics Collaborative April Meetup
Business and Data Analytics Collaborative April Meetup
 
Leveraging Analytics In Gaming - Tiny Mogul Games
Leveraging Analytics In Gaming - Tiny Mogul GamesLeveraging Analytics In Gaming - Tiny Mogul Games
Leveraging Analytics In Gaming - Tiny Mogul Games
 
Mihir Case Study one.pdf
Mihir Case Study one.pdfMihir Case Study one.pdf
Mihir Case Study one.pdf
 
3Ds of digital
3Ds of digital3Ds of digital
3Ds of digital
 
How Eastern Bank Uses Big Data to Better Serve and Protect its Customers
How Eastern Bank Uses Big Data to Better Serve and Protect its CustomersHow Eastern Bank Uses Big Data to Better Serve and Protect its Customers
How Eastern Bank Uses Big Data to Better Serve and Protect its Customers
 
Business Plan
Business PlanBusiness Plan
Business Plan
 
How to Perform Churn Analysis for your Mobile Application?
How to Perform Churn Analysis for your Mobile Application?How to Perform Churn Analysis for your Mobile Application?
How to Perform Churn Analysis for your Mobile Application?
 
Journey to the Perfect Application: Run a Business, not a Backlog
Journey to the Perfect Application: Run a Business, not a BacklogJourney to the Perfect Application: Run a Business, not a Backlog
Journey to the Perfect Application: Run a Business, not a Backlog
 
Analysis Report of Greek Blogosphere By MineKnowledge
Analysis Report of Greek Blogosphere By MineKnowledgeAnalysis Report of Greek Blogosphere By MineKnowledge
Analysis Report of Greek Blogosphere By MineKnowledge
 

Último

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 

Último (20)

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 

Using browsing behavior history to predict user’s gender presenation

  • 1. 從瀏覽文章行為來預測使用者的性別 Using Browsing Behavior Log to Predict User’s Gender Rick , Kent , Koi
  • 2. Overview ● Huge Data Burn Money (燒錢啊) o 28 Million PV / Day o 7.7 Million UV / Day o Have Total 4.4 Billion Articles o Have Total 4.7 Million Registered User ● Only 2% Login , Who is 98% ?
  • 3. Problem Definition • Use Only 2% History Data to Prediction 98% users Train Model User Model To Predict Training Data Model Unknown Cookie’ Gender Result
  • 4. Training Flow Training Data Selection Raw Log Target Data Preprocessing Transformed Data Transformation Data Mining Pattern 取得最近三個月內的 有登入者瀏覽紀錄, 並且看過兩篇不同的 文上以上的使用者 使用 Naïve Bayes 演 算去來產生預測模型 • Feature Extraction • Feature Selection
  • 7. Naive Bayes in Python Scikit-learn http://scikit-learn.org/stable/modules/naive_bayes.html#bernoulli-naive-bayes
  • 9. Training Data Set Overview Item Description Comment Date 20150223 ~ 20150424 Total Click Counts 10908692 Login User Male : 149403 Female: 229448 Feature Before: 2543240 After : 508648 use chi-squre as feature selection
  • 10. Feature Extraction • Category Feature -> Binary Feature • Example Feature Name Feature Value Article Type A, B , C , D, E Feature Name Feature Value Article Type - A 0 ,1 Article Type - B 0 ,1 Article Type - C 0 ,1 Article Type - D 0 ,1 Article Type - E 0 ,1
  • 11. Features List Feature Name Description Example gender the gender of login user 1 or 2 cat The article’s category 旅遊 url is a blog url http://kittyfish.pixnet.net/blog/post/345 566174 ariticle_author the blog’s author kittyfish article_id the blog’s unique id 345566174 hours the time of click event 6 refers http://www.google.com/ country the country that predicted by ip address tw
  • 12. But …… Too Many Features(又是燒錢) • T = 2,450,000 x 2,543,240 • Many Irrelevant Feature for Prediction 2,543,240 Feature
  • 13. Feature Selection – Chi Square http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html http://www.slideshare.net/parth241989/chi-square-test-16093013 Chi Square Value Dependence with Result Large High Small Low • 2543240 Features -> 508648 Features • Precision 74% -> 81%
  • 14. Important Feature is ? feature_name male_prob female_prob male_count female_count total prob_distance cat_財經企管 0.137798 0.045564 20587 10454 31041 0.184468 cat_美容彩妝 0.062211 0.137009 9294 31436 40730 0.149596 cat_時尚流行 0.079325 0.151936 11851 34861 46712 0.145221 cat_親子育兒 0.079640 0.133178 11898 30557 42455 0.107076 cat_心情日記 0.180942 0.231797 27033 53185 80218 0.101709 cat_國外旅遊 0.152288 0.194490 22752 44625 67377 0.084403 author_XXXXX 0.049975 0.009037 7466 2073 9539 0.081877 cat_食譜分享 0.054607 0.093596 8158 21475 29633 0.077978 cat_圖文創作 0.085483 0.122831 12771 28183 40954 0.074696
  • 15. Important Feature is ? • 以分類就可以初步判定性別傾向 • 部份特定作者及文章,可以特別用來識別是否為男性 • 男性點擊分佈特定傾向大於女性,這在後續使用 GA 作線上實驗,男性的預 測精準度是大於女性,不謀而合
  • 16. Feature Distribution 少數的 feature 很具有引響力,但是其它的feature的長尾效應還是有的,對 於提升最後幾個百分點是有效力的
  • 17. Prediction Set Data Analysis Intersection/Training Intersection/Prediction hour 100.00% 91.67% author 94.37% 7.79% country 100.00 2.46% category 100.00 ??? article 84.53 2.64% referer 94.50% 8.76%
  • 18. Real War Record Live Experiment on PIXNET Falcon(Advertisement) System
  • 19. Validation by Google Analytics ● Is God ? ● How to Use ? UGD say Male UGD say Female GA Set 1 GA Set 2 GA Say Male GA Say Female GA Say Male GA Say Female An non-registration user Classification Model Prediction
  • 20. Prediction Set Data Analysis • 於由Prediction Data 遠高於 Training Data,故以 Training Set 為分母來看的 話,交集的比率頗高 • 但是以 Prediction Data 為分母的話,Article 、Author 、 Country、 Referer, 交集的比率均小於 10%,如下圖所示 • Article 及 Author 是因為 Pixnet 使用者的閱讀習慣集中在特定的文章,其 它的文章點擊次數非常的少,甚至沒有被其它人閱覽過 Prediction set Training Set Article 、Author 、Referrer Hour &Category Prediction Set Training Set
  • 21. Implementation - System Architecture
  • 22. Implementation - Technology-Inventor List Technology Tool Purpose Scikit-learn Machine learning library Redis Cookie profile database Python Programing language Celery Scheduling framework Redshift Large raw data datawarehouse Django & Rest framework Build api service for internal sytem
  • 23. Implement - Performance Tuning ● CPU ● Batch Prediction ● 1000 x Speed Up ● Parallel Process ● Full usage mulit-core – 8 x Speed Up ● Python ● Memory ● Garbage Collection ● Python - del
  • 24. Reference ● http://www.cs.ucr.edu/~eamonn/CE/Bayesian%20Classification%20withInsect_ex amples.pdf ● https://www.iperceptions.com/~/media/files/knowledge/whitepapers/iperceptio nsintentrecognitionenginewhitepaperfeb2014v13.ashx ● A Two-Stage Ensemble of Diverse Models for Advertisement ... ● http://www2.cs.uregina.ca/~dbd/cs831/notes/kdd/1_kdd.html ● Whyo use naive bayes : http://www.aaai.org/Papers/FLAIRS/2004/Flairs04- 097.pdf ● Unbias : http://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf