2. Overview
● Huge Data Burn Money (燒錢啊)
o 28 Million PV / Day
o 7.7 Million UV / Day
o Have Total 4.4 Billion Articles
o Have Total 4.7 Million Registered User
● Only 2% Login , Who is 98% ?
3. Problem Definition
• Use Only 2% History Data to Prediction 98% users
Train
Model
User Model
To Predict
Training Data
Model
Unknown Cookie’ Gender Result
9. Training Data Set Overview
Item Description Comment
Date 20150223 ~ 20150424
Total Click Counts 10908692
Login User Male : 149403
Female: 229448
Feature Before: 2543240
After : 508648
use chi-squre as feature
selection
10. Feature Extraction
• Category Feature -> Binary Feature
• Example
Feature Name Feature Value
Article Type A, B , C , D, E
Feature Name Feature Value
Article Type - A 0 ,1
Article Type - B 0 ,1
Article Type - C 0 ,1
Article Type - D 0 ,1
Article Type - E 0 ,1
11. Features List
Feature Name Description Example
gender the gender of login user 1 or 2
cat The article’s category 旅遊
url is a blog url http://kittyfish.pixnet.net/blog/post/345
566174
ariticle_author the blog’s author kittyfish
article_id the blog’s unique id 345566174
hours the time of click event 6
refers http://www.google.com/
country the country that predicted by ip address tw
12. But …… Too Many Features(又是燒錢)
• T = 2,450,000 x 2,543,240
• Many Irrelevant Feature for
Prediction
2,543,240 Feature
13. Feature Selection – Chi Square
http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html
http://www.slideshare.net/parth241989/chi-square-test-16093013
Chi Square Value Dependence with Result
Large High
Small Low
• 2543240 Features -> 508648
Features
• Precision 74% -> 81%
19. Validation by Google Analytics
● Is God ?
● How to Use ?
UGD say
Male
UGD say
Female
GA Set 1
GA Set 2
GA Say
Male
GA Say
Female
GA Say
Male
GA Say
Female
An non-registration user
Classification Model
Prediction
20. Prediction Set Data Analysis
• 於由Prediction Data 遠高於 Training Data,故以 Training Set 為分母來看的
話,交集的比率頗高
• 但是以 Prediction Data 為分母的話,Article 、Author 、 Country、 Referer,
交集的比率均小於 10%,如下圖所示
• Article 及 Author 是因為 Pixnet 使用者的閱讀習慣集中在特定的文章,其
它的文章點擊次數非常的少,甚至沒有被其它人閱覽過
Prediction set Training
Set
Article 、Author 、Referrer Hour &Category
Prediction
Set
Training Set
22. Implementation - Technology-Inventor List
Technology Tool Purpose
Scikit-learn Machine learning library
Redis Cookie profile database
Python Programing language
Celery Scheduling framework
Redshift Large raw data datawarehouse
Django & Rest framework Build api service for internal sytem
23. Implement - Performance Tuning
● CPU
● Batch Prediction
● 1000 x Speed Up
● Parallel Process
● Full usage mulit-core – 8 x Speed Up
● Python
● Memory
● Garbage Collection
● Python - del