PyData SF 2016 --- Moving forward through the darkness

Moving Forward Through
The Darkness
the blindness of modeling and
how to break through
Chia-Chi@PyData SF 2016

About Chia-Chi (George)
● Organizer of Taiwan R User Group and MLDM Monday
● 7 years experience in quantitative trading in future & option market
● 5 years consultant experience in machine learning & data mining
● 4 years experience in e-commerce (consultant & join SaaS teams)
● 4 years experience in building of recommendation and search engine
● Volenteer in PyCon APAC 2014 (program officer)
● Volenteer in PyCon APAC 2015 (program officer)
● I love python and hope I can write python everyday !

Training models
from data
is just like scketching pictures
from the world

Jackson Pollock：
The painting has a life of its own.
I try to let it come through.

As a data scientist …
The data has a life of its own.
I just try to let it come through.

The first step …
is not picking up your pen !
is choosing an angle !
and do some observation !

Now, the object you want to sckeching ...
is your data!

Try to scketch it : y = a_0 + a_1 x

Which line is the most
similar one ?
It depends on you observation angle!

What the angle means ...
in a machine learning problem ?

After Chose an angle ...
You chose the question & the evaluator ...
(as a data scientist)

How to change
the angle?
In the Linear Regression Problem

The Metaphor
Data (the object) +
Evaluator (view of point | angle)
=> Model (picture)

Different Angles ...
Different Pictures …
(in scketching)

Different questions ...
Different models …
(in data science)

Whatever you observe …
Whatever you draw !
(both in scketching and data science)

The two keys
Help you apply machine learning
in the real world

Can Learn ONLY
Through Real
Practice
Can Learn from
School or Practice

Modeling Procedures:
● Choose a Real Problem
● Collecting Related Data
● Choose a method convert Data to Vectors (or Tensors)
● Decompose Real Problem into several ML or Math Problems
● Solve each ML or Math Problem individually
● Combine the Solutions of all ML or Math Problems
● Check is that truly solve the Real Problem ?

Case Study:
How to build
a Recommendation System in
News Platform

User-Centered Recommendation
News you
probabily also
want to read

Platform
Tracking
Users Behavior
Data Feed Response
News Data Results
Machine Learning
Server Group
Server Group
Processing Data Prediction Data

Modeling Procedures -- Part 1:
● Problem: how to make users reach more news they want to read ?
● Data:
○ News Data (Article)
■ Title
■ Text
■ Time
■ Category
○ User Behavior Data
■ User View Post
■ User Click Links
■ User not Click Links

● Data to Vector (or Tensors)
○ News Data
■ TermDocumentMatrix (scikit-learn)
■ Word2Vec (gensim word2vec)
○ User Behavior Data
■ Event Sampling (Spark streaming, Kafka, or Traildb)
● construct user-item matrix (user view|click|not-click events)
● construct item-item matrix (view-after-view click-after-view ... )
● construct user-item-time tensor cube
● construct user-item-item-time tensor cube

● Real to ML (or Math) and Solve ML (or Math) Problems
○ Real Problem: how to make users reach more news they want to read ?
○ ML (or Math) Problems:
■ Hottest & Newest
■ Content-Based Relations
■ Collaborative Filtering

Content-Based Relations: Clustering
With
TermDocumentM
atrix
Or
The results
coming from
word2vec

Collaborative Filtering: MF & Matrix Completion

Use only 20% data to re-generate full image !
Ref: ipynb@github

● Combine Solutions together and Check it with Real Problem
○ Ensemble Learning (for static combination)
○ Reinforcement Learning (for dynamic improvement)

Recap: Modeling Procedures:

The Blindness
in the Modeling Procedures

Blindness of Modeling Procedures:

Problem Data
Probelm-Driven:
Thinking Data
Through Problem
Data-Driven:
ThinkingProblem
ThroughData
Problem
behind
Problem
Information
behind Data
Business
Insights

The Blindness between
Data and Problem
Is there any related information in that data ?
Could the problem answer by this data ?

Case Study: Bookstore
Could you use our POS data
to find some methods to convert
those users who originally dislike us?

Case Study: Bookstore
Could you use our POS data
to find the potential users ?
"potential" means users want to buy it
but they haven't

Data from POS
ONLY has the information
about converted users
There is no information
about disliked and unconverted users

Thinking in Two Ways
Data-Driven
Problem-Driven

Problem
???
POS
Data
Probelm-Driven:
Thinking Data
Through Problem
Data-Driven:
ThinkingProblem
ThroughData
Problem
behind
Problem
Information
behind Data
Business
Insights

Gain
Bookstore's
Revenue ?
Data
???
Probelm-Driven:
Thinking Data
Through Problem
Data-Driven:
ThinkingProblem
ThroughData
Problem
behind
Problem
Information
behind Data
Business
Insights

Case Study:
LBS Food Search
a story about
"delicous" is not delicous !
(this is also the blindness of NLP)

Data Thinking
First-Hand Versus Second-Hand
for example,
delicous versus "delicous"

Machine could NOT Learn
by itself.
It just like a child.
It learn by training data !
sometimes would learn badly!

● Choose a method convert Data to Vectors (or
Tensors)

The Blindness From
Data to Vector
Is there any information losing
when you are converting your data?

Blindness of unigram
I love it (我愛它) = it love me (它愛我)
I hate it (我恨它) = it hate me (它恨我)

The Blindness From
Mathematical Concept
The gap between math and real world:
When putting the units back to the formula …

Math in Elementary School …
The secret behind the minus operator
● 103 - 100 = 6 - 3 ?
● 103 dollars - 100 dollars = 6 dollars - 3 dollars ?
● 103 dollar stock - 100 dollars stock = 6 dollars stock - 3 dollars stock ?
(formula)(units)
● (103 - 100 = 6 - 3) (dollars)
● (103 - 100 = 6 - 3) (dollars stock)
How to choose the right coordinate for stock price ?

● Decompose Real Problem into several ML or
Math Problems

The Blindness From
ML Frameworks
Classification & Clustering

When orange-apple classifier meet an banana?

When a digital classifier meet an alphabet ?
A -> 9

The blindness of clustering methods

Cannot force two points in the same cluster

The fact is …
We always get some data with labels
But some without
(1) How to propograte labels ?
(2) How to detect new labels with labelers?

New Data & New Labels
are coming all the way
In e-commerce retailers &
In news platforms

What we need …
(1) Classifier just like a clustering method
(one-versus-all incremental classifier)
(2) Clustering Method just like a Classifier
(Metric Learning)

one-versus-all incremental classifier
Not
Class 1
Not
Class 2

Actually …
You can also use deep neural network
to construct the metric learning staff

Metric Learning
Always give me a whole new angle
to observe the world

Remember that ... !
Whatever you observe …
Whatever you draw !

PyData SF 2016 --- Moving forward through the darkness

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (19)

Destaque

Destaque (7)

Semelhante a PyData SF 2016 --- Moving forward through the darkness

Semelhante a PyData SF 2016 --- Moving forward through the darkness (20)

Mais de Chia-Chi Chang

Mais de Chia-Chi Chang (13)

Último

Último (20)

PyData SF 2016 --- Moving forward through the darkness