This document discusses various types of "blindness" that can occur when applying machine learning modeling procedures and techniques. It notes that modeling procedures often focus on decomposing problems and data in a way that can lose important connections or information. Specific issues highlighted include the gap between problems and available data, information loss when converting data to vectors, disconnects between mathematical concepts and real-world applications, limitations of individual ML techniques, and challenges with new data and labels. The document advocates thinking more from both data-driven and problem-driven perspectives, and considering alternative techniques that can bridge gaps, such as metric learning and one-versus-all classifiers.
PyData SF 2016 --- Moving forward through the darkness
1. Moving Forward Through
The Darkness
the blindness of modeling and
how to break through
Chia-Chi@PyData SF 2016
2.
3. About Chia-Chi (George)
● Organizer of Taiwan R User Group and MLDM Monday
● 7 years experience in quantitative trading in future & option market
● 5 years consultant experience in machine learning & data mining
● 4 years experience in e-commerce (consultant & join SaaS teams)
● 4 years experience in building of recommendation and search engine
● Volenteer in PyCon APAC 2014 (program officer)
● Volenteer in PyCon APAC 2015 (program officer)
● I love python and hope I can write python everyday !
32. Modeling Procedures:
● Choose a Real Problem
● Collecting Related Data
● Choose a method convert Data to Vectors (or Tensors)
● Decompose Real Problem into several ML or Math Problems
● Solve each ML or Math Problem individually
● Combine the Solutions of all ML or Math Problems
● Check is that truly solve the Real Problem ?
36. Modeling Procedures -- Part 1:
● Problem: how to make users reach more news they want to read ?
● Data:
○ News Data (Article)
■ Title
■ Text
■ Time
■ Category
○ User Behavior Data
■ User View Post
■ User Click Links
■ User not Click Links
37. Modeling Procedures -- Part 2:
● Data to Vector (or Tensors)
○ News Data
■ TermDocumentMatrix (scikit-learn)
■ Word2Vec (gensim word2vec)
○ User Behavior Data
■ Event Sampling (Spark streaming, Kafka, or Traildb)
● construct user-item matrix (user view|click|not-click events)
● construct item-item matrix (view-after-view click-after-view ... )
● construct user-item-time tensor cube
● construct user-item-item-time tensor cube
38. Modeling Procedures -- Part 3:
● Real to ML (or Math) and Solve ML (or Math) Problems
○ Real Problem: how to make users reach more news they want to read ?
○ ML (or Math) Problems:
■ Hottest & Newest
■ Content-Based Relations
■ Collaborative Filtering
45. Use only 20% data to re-generate full image !
Ref: ipynb@github
46. Modeling Procedures -- Part 4:
● Combine Solutions together and Check it with Real Problem
○ Ensemble Learning (for static combination)
○ Reinforcement Learning (for dynamic improvement)
47.
48.
49.
50. Recap: Modeling Procedures:
● Choose a Real Problem
● Collecting Related Data
● Choose a method convert Data to Vectors (or Tensors)
● Decompose Real Problem into several ML or Math Problems
● Solve each ML or Math Problem individually
● Combine the Solutions of all ML or Math Problems
● Check is that truly solve the Real Problem ?
52. Blindness of Modeling Procedures:
● Choose a Real Problem
● Collecting Related Data
● Choose a method convert Data to Vectors (or Tensors)
● Decompose Real Problem into several ML or Math Problems
● Solve each ML or Math Problem individually
● Combine the Solutions of all ML or Math Problems
● Check is that truly solve the Real Problem ?
65. Machine could NOT Learn
by itself.
It just like a child.
It learn by training data !
sometimes would learn badly!
66. Blindness of Modeling Procedures:
● Choose a Real Problem
● Collecting Related Data
● Choose a method convert Data to Vectors (or
Tensors)
● Decompose Real Problem into several ML or Math Problems
● Solve each ML or Math Problem individually
● Combine the Solutions of all ML or Math Problems
● Check is that truly solve the Real Problem ?
67. The Blindness From
Data to Vector
Is there any information losing
when you are converting your data?
68. Blindness of unigram
I love it (我愛它) = it love me (它愛我)
I hate it (我恨它) = it hate me (它恨我)
70. Math in Elementary School …
The secret behind the minus operator
● 103 - 100 = 6 - 3 ?
● 103 dollars - 100 dollars = 6 dollars - 3 dollars ?
● 103 dollar stock - 100 dollars stock = 6 dollars stock - 3 dollars stock ?
(formula)(units)
● (103 - 100 = 6 - 3) (dollars)
● (103 - 100 = 6 - 3) (dollars stock)
How to choose the right coordinate for stock price ?
71. Blindness of Modeling Procedures:
● Choose a Real Problem
● Collecting Related Data
● Choose a method convert Data to Vectors (or Tensors)
● Decompose Real Problem into several ML or
Math Problems
● Solve each ML or Math Problem individually
● Combine the Solutions of all ML or Math Problems
● Check is that truly solve the Real Problem ?
77. The fact is …
We always get some data with labels
But some without
(1) How to propograte labels ?
(2) How to detect new labels with labelers?
78. New Data & New Labels
are coming all the way
In e-commerce retailers &
In news platforms
79. What we need …
(1) Classifier just like a clustering method
(one-versus-all incremental classifier)
(2) Clustering Method just like a Classifier
(Metric Learning)