SlideShare uma empresa Scribd logo
1 de 57
Baixar para ler offline
Summer Internship
Final Report
Naoki Ishikawa (@NeokiStones)
2015/09/30 13:30-
Who am I
2
• Naoki Ishikawa
• Waseda University, Information Science M1
• Research: Evolutional Computation/
Reinforcement Learning
• Laboratory: Sugawara Lab
• Laboratory theme: Artificial Intelligence
• Implemented Algorithm
• Factorization Machine
• Latent Dirichlet Allocation
3
Table of contents
• Implemented Algorithm
• Factorization Machine
• Latent Dirichlet Allocation
4
Table of contents
Factorization Machine
5
• Algorithm for Recommendation
• Classification(Clustering)
• Regression
• Supervised Learning
• Need Input/Output Data
• Suitable for Sparse Data
Application
Application
7
• Prediction of Movie Rating
• Task: Prediction movie rating

(real number)
• Regression

- Input: Self-designed Matrix 

- Output: Rating Vector
8
Input Output
Prediction of Movie Rating
INPUT Details
9
• Identifier

- User Identifier : [0, 0, …, 0, 1, 0, …,0]

- Movie Identifier : [0, 0, …, 0, 0, 1, 0, …,0]
• Designed Feature

- Rating of Other Movie

- Time

- Last Movie rated
10
Recommendation Algorithm
• Collaborative Filtering
• Associations Analysis
• Bayesian Network
Prediction of Movie Rating
11
• Hivemall
• Matrix Factorization
• Recommendation
12
Difference from Matrix Factorization
• Data Structure
• Matrix Factorization
• User-Item Matrix
http://ampcamp.berkeley.edu/big-data-mini-course/img/matrix_factorization.png
Input
Learning Parameter
13
Difference from Matrix Factorization
• Factorization Machine
Vv
k
Input
Learning Parameter
Wk
1
14
• Factorization Machine
• Consider
• context data
• Interaction between valuables
Advantage of Factorization Machine
15
Difference from Matrix Factorization
Prediction by Factorization Machine (d=2)
16
Difference from Matrix Factorization
Prediction by Factorization Machine (d=2)
(mean)
Global bias
Interaction
Factorization
(Wkj)
Regression coefficience
of k-th variable
17
Difference from Matrix Factorization
Prediction by Factorization Machine (d=2)
Learning Method
Stochastic Gradient descent(SGD)
18
Local Implementation
19
Difference from Matrix Factorization
• d-way
• FM / MF
• assume K latent attributes
• Matrix Factorization: d = 2
• Factorization Machine: d 2
20
HyperParameter
• K: the number of hidden factor
• η: the regulation parameter
21
Implemented Model
• Implemented Model
• d = 2
• MapModel
• ArrayModel
22
Implemented Model
• MapModel
• For unknown data
• Flexible
• Suitable for Online Learning
23
Implemented Model
• ArrayModel
• For known data
• less overhead
24
Other Use Case
• E-Commerce User-Item Recommendation
• Input Data
• Age
• Purchase timezone
• Past bought items
• Cluster ID
• Target Data
• Evaluation of
an Item by User
• Implemented Algorithm
• Factorization Machine
• Latent Dirichlet Allocation
25
Table of contents
Latent Dirichlet Allocation
26
• Most Popular Algorithm of Topic Model
• Mostly applied for text data
• Find hidden structure of data
• Unsupervised Learning
• Need Input Data only
• Generative Model
Latent Dirichlet Allocation
27
• Generative Modelling in LDA
• Mimic how to generate Document
• 1. Choose what you write about
• 2. Choose word from the Topic
• 3. Write
Latent Dirichlet Allocation
28
• Input
• Text data (Documents)
• Output
• Topic-word distribution
• Document-Topic distribution
Latent Dirichlet Allocation
29
https://www.vappingo.com/word-blog/wp-content/uploads/2011/01/paper2.jpg
https://wellecks.wordpress.com/2014/10/26/ldaoverflow-with-online-lda/
Learning Method
30
• Define Generative model
• For documents
• Learn parameters to reproduce the
document
Learning Method
31
K
Topic
Learning Method
32
http://heartruptcy.blog.fc2.com/blog-entry-124.html
Graphical Model(Code)
33
• For Topic ={1,…, K}
• WordDistribution[k] Dir(β)
For Document={1,…, D}
TopicDistribution[d] Dir(α)
For Word={1,…, numOfWord[d]}
WordTopic[d][n] TopicDistribution[d]
Word[d][n] WordDistribution[WordTopic[d][n]]
Learning Method
34
• Variational Bayes
• Gibbs Sampling (MCMC)
• Particle Filtering
Learning Method
35
• Variational Bayes
• Gibbs Sampling (MCMC)
• Particle Filtering
faster than Gibbs Sampling
Mini-batch Online LDA
36
• Faster than Batch Algorithm
• Less noise than pure Online LDA
Pure Online
Mini-batch
Online
Batch
Batch Size
37
Implemented Model
• Mini-Batch Map Model
• For unknown data
• Don t assume Vocabulary List
• Mini-Batch Array Model (Other
implementation)
• For known data
• Assume Vocabulary List
• Mini-Batch Map Model
• For unknown data
• Don t assume Vocabulary List
38
Implemented Model
• Mini-Batch Array Model (Other
implementation)
• For known data
• Assume Vocabulary List
• Meaning Less word
• LDA: Clustering word by co-occurrence
• a , the , I , He , is , in , on
• Stop Word: Ignore them
• TF-IDF: how important a word is to a
document in a collection or dataset
39
Faced Implementation Problem
40
Faced Implementation Problem
• Meaning Less word
• LDA: Clustering word by co-occurrence
• a , the , I , He , is , in , on
• Stop Word: Ignore them
• TF-IDF: how important a word is to a
document in a collection or dataset
• TF-IDF
• can be calculated by Hivemall
• Input Data: (DocId, Words)
• https://github.com/myui/hivemall/wiki/
TFIDF-calculation
41
Faced Implementation Problem
• 1 ["justice:0.1641245850805637","found:0.06564983513276658","discussion:
0.06564983513276658","law:0.065
• 64983513276658","based:0.06564983513276658","religion:
0.06564983513276658","viewpoints:0.03282491756638329","
• rationality:0.03282491756638329","including:0.03282491756638329","context:
0.03282491756638329","concept:0.032
• 82491756638329","rightness:0.03282491756638329","general:
0.03282491756638329","many:0.03282491756638329","dif
• fering:0.03282491756638329","fairness:0.03282491756638329","social:
0.03282491756638329","broadest:0.032824917
• 56638329 ,"equity:0.03282491756638329","includes:
0.03282491756638329","theology:0.03282491756638329"]
42
Faced Implementation Problem
• TF-IDF
• Vocabulary List Model
• Initialize all lambda for all words at first
• if word does not appear in the Doc:
• Lambda decreases at the same rate
• No initialization problem
43
Faced Implementation Problem
• Online Map Model
• Initialize lambda when new word fetched
• final lambda: 

depend on the first appeared time
• Initialize problem
44
Faced Implementation Problem
• Prepared Dummy Lambda
• Initialize dummy lambdas at first
• Apply lambda update rule for dummy
lambda
45
Faced Implementation Problem
• Implicit Φ Normalization
• Not written implicitly
46
Faced Implementation Problem
• Implicit Φ Normalization
• Not written implicitly
47
Faced Implementation Problem
• Implicit Φ Normalization
• Not written explicitly
48
Faced Implementation Problem
49
Faced Implementation Problem
• Difficult Debugging
• Circular reference
Φ
γ β
:dependence
• Data: 20News
• Topic:6
• Iteration:10
50
Result: Online LDA
• Topic:1
• No.0 writes[6]: 0.007909349
• No.1 article[7]: 0.006535292
• No.2 apr[3]: 0.0034389505
• No.3 team[4]: 0.00340712
• No.4 game[4]: 0.0033219245
• No.5 year[4]: 0.0032751847
• No.6 good[4]: 0.0032546786
• No.7 time[4]: 0.0030503264
• No.8 play[4]: 0.00262638
• No.9 games[5]: 0.002433915
• No.10 season[6]: 0.0022433712
• No.11 ll[2]: 0.0020719478
• No.12 players[7]: 0.0020332362
• No.13 win[3]: 0.0019284738
• No.14 hockey[6]: 0.0018870989
51
Result: Online LDA
• No.15 league[6]: 0.0018450991
• No.16 baseball[8]: 0.0018226414
• No.17 years[5]: 0.0017960512
• No.18 mail[4]: 0.0017936684
• No.19 people[6]: 0.0017642054
• No.20 teams[5]: 0.0016675185
• No.21 great[5]: 0.001642102
• No.22 ve[2]: 0.0015846819
• No.23 point[5]: 0.0015730233
• No.24 cs[2]:0.0015609838
• No.25 didn[4]: 0.0015398773
• No.26 lot[3]: 0.0015123658
• No.27 mike[4]: 0.0014935194
• No.28 university[10]: 0.0014718652
• No.29 player[6]: 0.0014655796
• Topic:1
• No.0 writes[6]: 0.007909349
• No.1 article[7]: 0.006535292
• No.2 apr[3]: 0.0034389505
• No.3 team[4]: 0.00340712
• No.4 game[4]: 0.0033219245
• No.5 year[4]: 0.0032751847
• No.6 good[4]: 0.0032546786
• No.7 time[4]: 0.0030503264
• No.8 play[4]: 0.00262638
• No.9 games[5]: 0.002433915
• No.10 season[6]: 0.0022433712
• No.11 ll[2]: 0.0020719478
• No.12 players[7]: 0.0020332362
• No.13 win[3]: 0.0019284738
• No.14 hockey[6]: 0.0018870989
52
Result: Online LDA
• No.15 league[6]: 0.0018450991
• No.16 baseball[8]: 0.0018226414
• No.17 years[5]: 0.0017960512
• No.18 mail[4]: 0.0017936684
• No.19 people[6]: 0.0017642054
• No.20 teams[5]: 0.0016675185
• No.21 great[5]: 0.001642102
• No.22 ve[2]: 0.0015846819
• No.23 point[5]: 0.0015730233
• No.24 cs[2]:0.0015609838
• No.25 didn[4]: 0.0015398773
• No.26 lot[3]: 0.0015123658
• No.27 mike[4]: 0.0014935194
• No.28 university[10]: 0.0014718652
• No.29 player[6]: 0.0014655796
Sports
• Topic:3
• No.0 writes[6]: 0.0065424195
• No.1 article[7]: 0.005621346
• No.2 apr[3]: 0.002746017
• No.3 work[4]: 0.002731466
• No.4 good[4]: 0.00266331
• No.5 ve[2]: 0.0025969497
• No.6 time[4]: 0.0025880735
• No.7 system[6]: 0.0024449623
• No.8 problem[7]: 0.002349667
• No.9 mail[4]: 0.0023234019
• No.10 windows[7]: 0.0021310966
• No.11 people[6]: 0.0018598152
• No.12 find[4]: 0.0018072439
• No.13 computer[8]: 0.0017470584
• No.14 email[5]: 0.0017204053
53
Result: Online LDA
• No.15 drive[5]: 0.0017121765
• No.16 bit[3]: 0.0016401116
• No.17 program[7]: 0.001636191
• No.18 software[8]: 0.0016341405
• No.19 university[10]: 0.0015907411
• No.20 ll[2]: 0.0015530549
• No.21 thing[5]: 0.0015159848
• No.22 card[4]: 0.0013826761
• No.23 doesn[5]: 0.0013809163
• No.24 phone[5]: 0.0013786326
• No.25 question[8]: 0.0013721529
• No.26 internet[8]:0.001368883
• No.27 file[4]: 0.0013417117
• No.28 things[6]: 0.0013097903
• No.29 set[3]: 0.0013029057
• Topic:3
• No.0 writes[6]: 0.0065424195
• No.1 article[7]: 0.005621346
• No.2 apr[3]: 0.002746017
• No.3 work[4]: 0.002731466
• No.4 good[4]: 0.00266331
• No.5 ve[2]: 0.0025969497
• No.6 time[4]: 0.0025880735
• No.7 system[6]: 0.0024449623
• No.8 problem[7]: 0.002349667
• No.9 mail[4]: 0.0023234019
• No.10 windows[7]: 0.0021310966
• No.11 people[6]: 0.0018598152
• No.12 find[4]: 0.0018072439
• No.13 computer[8]: 0.0017470584
• No.14 email[5]: 0.0017204053
54
Result: Online LDA
• No.15 drive[5]: 0.0017121765
• No.16 bit[3]: 0.0016401116
• No.17 program[7]: 0.001636191
• No.18 software[8]: 0.0016341405
• No.19 university[10]: 0.0015907411
• No.20 ll[2]: 0.0015530549
• No.21 thing[5]: 0.0015159848
• No.22 card[4]: 0.0013826761
• No.23 doesn[5]: 0.0013809163
• No.24 phone[5]: 0.0013786326
• No.25 question[8]: 0.0013721529
• No.26 internet[8]:0.001368883
• No.27 file[4]: 0.0013417117
• No.28 things[6]: 0.0013097903
• No.29 set[3]: 0.0013029057
Computer
Impression about Internship
55
• Machine Learning
• Implementing ML algorithm from
Scratch was fun
• Contributing for OSS is precious
experience for me
Unfinished Business
56
• Documentation
• write entry for FM/Online LDA
• UDTF
• build the function into Hivemall
57
• Thank you for Listening

Mais conteúdo relacionado

Destaque

Testing Forest-Isomorphism in the Adjacency List Model
Testing Forest-Isomorphismin the Adjacency List ModelTesting Forest-Isomorphismin the Adjacency List Model
Testing Forest-Isomorphism in the Adjacency List Model
irrrrr
 
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
rchbeir
 
Recommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human CategorizationRecommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human Categorization
Christoph Trattner
 
20 cv mil_models_for_words
20 cv mil_models_for_words20 cv mil_models_for_words
20 cv mil_models_for_words
zukun
 
A Semantics-based Approach to Machine Perception
A Semantics-based Approach to Machine PerceptionA Semantics-based Approach to Machine Perception
A Semantics-based Approach to Machine Perception
Cory Andrew Henson
 

Destaque (20)

Testing Forest-Isomorphism in the Adjacency List Model
Testing Forest-Isomorphismin the Adjacency List ModelTesting Forest-Isomorphismin the Adjacency List Model
Testing Forest-Isomorphism in the Adjacency List Model
 
トレジャーデータ 導入体験記 リブセンス編
トレジャーデータ 導入体験記 リブセンス編トレジャーデータ 導入体験記 リブセンス編
トレジャーデータ 導入体験記 リブセンス編
 
第2章アーキテクチャ
第2章アーキテクチャ第2章アーキテクチャ
第2章アーキテクチャ
 
EventSystemまわりの話@UnityFukuoka07
EventSystemまわりの話@UnityFukuoka07 EventSystemまわりの話@UnityFukuoka07
EventSystemまわりの話@UnityFukuoka07
 
tmu_science_cafe02
tmu_science_cafe02tmu_science_cafe02
tmu_science_cafe02
 
Latent Semantic Indexing and Search Engines Optimimization (SEO)
Latent Semantic Indexing and Search Engines Optimimization (SEO)Latent Semantic Indexing and Search Engines Optimimization (SEO)
Latent Semantic Indexing and Search Engines Optimimization (SEO)
 
Geometric Aspects of LSA
Geometric Aspects of LSAGeometric Aspects of LSA
Geometric Aspects of LSA
 
Analysis of Reviews on Sony Z3
Analysis of Reviews on Sony Z3Analysis of Reviews on Sony Z3
Analysis of Reviews on Sony Z3
 
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
 
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
 
Recommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human CategorizationRecommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human Categorization
 
Mathematical approach for Text Mining 1
Mathematical approach for Text Mining 1Mathematical approach for Text Mining 1
Mathematical approach for Text Mining 1
 
AutoCardSorter - Designing the Information Architecture of a web site using L...
AutoCardSorter - Designing the Information Architecture of a web site using L...AutoCardSorter - Designing the Information Architecture of a web site using L...
AutoCardSorter - Designing the Information Architecture of a web site using L...
 
20 cv mil_models_for_words
20 cv mil_models_for_words20 cv mil_models_for_words
20 cv mil_models_for_words
 
クックパッドサマーインターン2015 機械学習・自然言語処理 実習課題
クックパッドサマーインターン2015 機械学習・自然言語処理 実習課題クックパッドサマーインターン2015 機械学習・自然言語処理 実習課題
クックパッドサマーインターン2015 機械学習・自然言語処理 実習課題
 
Practical Machine Learning
Practical Machine Learning Practical Machine Learning
Practical Machine Learning
 
Mining Features from the Object-Oriented Source Code of a Collection of Softw...
Mining Features from the Object-Oriented Source Code of a Collection of Softw...Mining Features from the Object-Oriented Source Code of a Collection of Softw...
Mining Features from the Object-Oriented Source Code of a Collection of Softw...
 
SNAPP - Learning Analytics and Knowledge Conference 2011
SNAPP - Learning Analytics and Knowledge Conference 2011SNAPP - Learning Analytics and Knowledge Conference 2011
SNAPP - Learning Analytics and Knowledge Conference 2011
 
A Semantics-based Approach to Machine Perception
A Semantics-based Approach to Machine PerceptionA Semantics-based Approach to Machine Perception
A Semantics-based Approach to Machine Perception
 
Latent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet MixtureLatent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet Mixture
 

Semelhante a Treasure Data Summer Internship Final Report

TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR
TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADRTweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR
TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR
Lucidworks
 
The Pharo Debugger and Debugging tools: Advances and Roadmap
The Pharo Debugger and Debugging tools: Advances and RoadmapThe Pharo Debugger and Debugging tools: Advances and Roadmap
The Pharo Debugger and Debugging tools: Advances and Roadmap
ESUG
 
SQL SCIPY STREAMLIT_Introduction to the basic of SQL SCIPY STREAMLIT
SQL SCIPY STREAMLIT_Introduction to the basic of SQL SCIPY STREAMLITSQL SCIPY STREAMLIT_Introduction to the basic of SQL SCIPY STREAMLIT
SQL SCIPY STREAMLIT_Introduction to the basic of SQL SCIPY STREAMLIT
chaitalidarode1
 
How did it go? The first large enterprise search project in Europe using Shar...
How did it go? The first large enterprise search project in Europe using Shar...How did it go? The first large enterprise search project in Europe using Shar...
How did it go? The first large enterprise search project in Europe using Shar...
Petter Skodvin-Hvammen
 
datamining-introduction.pdf
datamining-introduction.pdfdatamining-introduction.pdf
datamining-introduction.pdf
ssuser3e6464
 

Semelhante a Treasure Data Summer Internship Final Report (20)

Big Search 4 Big Data War Stories
Big Search 4 Big Data War StoriesBig Search 4 Big Data War Stories
Big Search 4 Big Data War Stories
 
Production machine learning_infrastructure
Production machine learning_infrastructureProduction machine learning_infrastructure
Production machine learning_infrastructure
 
Big Data Applied, Data Warehouse Institute St. Louis December 2013 speech
Big Data Applied, Data Warehouse Institute St. Louis December 2013 speechBig Data Applied, Data Warehouse Institute St. Louis December 2013 speech
Big Data Applied, Data Warehouse Institute St. Louis December 2013 speech
 
Spcua 2013 Alexey Kozhemiakin Enterprise Search
Spcua 2013 Alexey Kozhemiakin Enterprise SearchSpcua 2013 Alexey Kozhemiakin Enterprise Search
Spcua 2013 Alexey Kozhemiakin Enterprise Search
 
TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR
TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADRTweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR
TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR
 
The Pharo Debugger and Debugging tools: Advances and Roadmap
The Pharo Debugger and Debugging tools: Advances and RoadmapThe Pharo Debugger and Debugging tools: Advances and Roadmap
The Pharo Debugger and Debugging tools: Advances and Roadmap
 
NISO-Altmetrics-NE-ACRL-ScholComIG-Nov2013
NISO-Altmetrics-NE-ACRL-ScholComIG-Nov2013NISO-Altmetrics-NE-ACRL-ScholComIG-Nov2013
NISO-Altmetrics-NE-ACRL-ScholComIG-Nov2013
 
Big Data Visualization
Big Data VisualizationBig Data Visualization
Big Data Visualization
 
SQL SCIPY STREAMLIT_Introduction to the basic of SQL SCIPY STREAMLIT
SQL SCIPY STREAMLIT_Introduction to the basic of SQL SCIPY STREAMLITSQL SCIPY STREAMLIT_Introduction to the basic of SQL SCIPY STREAMLIT
SQL SCIPY STREAMLIT_Introduction to the basic of SQL SCIPY STREAMLIT
 
How did it go? The first large enterprise search project in Europe using Shar...
How did it go? The first large enterprise search project in Europe using Shar...How did it go? The first large enterprise search project in Europe using Shar...
How did it go? The first large enterprise search project in Europe using Shar...
 
datamining-introduction.pdf
datamining-introduction.pdfdatamining-introduction.pdf
datamining-introduction.pdf
 
Autodiscovery or The long tail of open data
Autodiscovery or The long tail of open dataAutodiscovery or The long tail of open data
Autodiscovery or The long tail of open data
 
Azure Machine Learning
Azure Machine LearningAzure Machine Learning
Azure Machine Learning
 
Data Analytics for Smart Product Development
Data Analytics for Smart Product DevelopmentData Analytics for Smart Product Development
Data Analytics for Smart Product Development
 
Core Hack Day 2
Core Hack Day 2Core Hack Day 2
Core Hack Day 2
 
Analytics and Digital Storytelling
Analytics and Digital StorytellingAnalytics and Digital Storytelling
Analytics and Digital Storytelling
 
Escaping Datageddon
Escaping DatageddonEscaping Datageddon
Escaping Datageddon
 
Incident response before:after breach
Incident response before:after breachIncident response before:after breach
Incident response before:after breach
 
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malware
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malwareDefcon 22-wesley-mc grew-instrumenting-point-of-sale-malware
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malware
 
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
 

Último

GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
Lokesh Kothari
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
gindu3009
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
RizalinePalanog2
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Lokesh Kothari
 

Último (20)

GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Creating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening DesignsCreating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening Designs
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 

Treasure Data Summer Internship Final Report

  • 1. Summer Internship Final Report Naoki Ishikawa (@NeokiStones) 2015/09/30 13:30-
  • 2. Who am I 2 • Naoki Ishikawa • Waseda University, Information Science M1 • Research: Evolutional Computation/ Reinforcement Learning • Laboratory: Sugawara Lab • Laboratory theme: Artificial Intelligence
  • 3. • Implemented Algorithm • Factorization Machine • Latent Dirichlet Allocation 3 Table of contents
  • 4. • Implemented Algorithm • Factorization Machine • Latent Dirichlet Allocation 4 Table of contents
  • 5. Factorization Machine 5 • Algorithm for Recommendation • Classification(Clustering) • Regression • Supervised Learning • Need Input/Output Data • Suitable for Sparse Data
  • 7. Application 7 • Prediction of Movie Rating • Task: Prediction movie rating
 (real number) • Regression
 - Input: Self-designed Matrix 
 - Output: Rating Vector
  • 9. INPUT Details 9 • Identifier
 - User Identifier : [0, 0, …, 0, 1, 0, …,0]
 - Movie Identifier : [0, 0, …, 0, 0, 1, 0, …,0] • Designed Feature
 - Rating of Other Movie
 - Time
 - Last Movie rated
  • 10. 10 Recommendation Algorithm • Collaborative Filtering • Associations Analysis • Bayesian Network
  • 11. Prediction of Movie Rating 11 • Hivemall • Matrix Factorization • Recommendation
  • 12. 12 Difference from Matrix Factorization • Data Structure • Matrix Factorization • User-Item Matrix http://ampcamp.berkeley.edu/big-data-mini-course/img/matrix_factorization.png Input Learning Parameter
  • 13. 13 Difference from Matrix Factorization • Factorization Machine Vv k Input Learning Parameter Wk 1
  • 14. 14 • Factorization Machine • Consider • context data • Interaction between valuables Advantage of Factorization Machine
  • 15. 15 Difference from Matrix Factorization Prediction by Factorization Machine (d=2)
  • 16. 16 Difference from Matrix Factorization Prediction by Factorization Machine (d=2) (mean) Global bias Interaction Factorization (Wkj) Regression coefficience of k-th variable
  • 17. 17 Difference from Matrix Factorization Prediction by Factorization Machine (d=2) Learning Method Stochastic Gradient descent(SGD)
  • 19. 19 Difference from Matrix Factorization • d-way • FM / MF • assume K latent attributes • Matrix Factorization: d = 2 • Factorization Machine: d 2
  • 20. 20 HyperParameter • K: the number of hidden factor • η: the regulation parameter
  • 21. 21 Implemented Model • Implemented Model • d = 2 • MapModel • ArrayModel
  • 22. 22 Implemented Model • MapModel • For unknown data • Flexible • Suitable for Online Learning
  • 23. 23 Implemented Model • ArrayModel • For known data • less overhead
  • 24. 24 Other Use Case • E-Commerce User-Item Recommendation • Input Data • Age • Purchase timezone • Past bought items • Cluster ID • Target Data • Evaluation of an Item by User
  • 25. • Implemented Algorithm • Factorization Machine • Latent Dirichlet Allocation 25 Table of contents
  • 26. Latent Dirichlet Allocation 26 • Most Popular Algorithm of Topic Model • Mostly applied for text data • Find hidden structure of data • Unsupervised Learning • Need Input Data only • Generative Model
  • 27. Latent Dirichlet Allocation 27 • Generative Modelling in LDA • Mimic how to generate Document • 1. Choose what you write about • 2. Choose word from the Topic • 3. Write
  • 28. Latent Dirichlet Allocation 28 • Input • Text data (Documents) • Output • Topic-word distribution • Document-Topic distribution
  • 30. Learning Method 30 • Define Generative model • For documents • Learn parameters to reproduce the document
  • 33. Graphical Model(Code) 33 • For Topic ={1,…, K} • WordDistribution[k] Dir(β) For Document={1,…, D} TopicDistribution[d] Dir(α) For Word={1,…, numOfWord[d]} WordTopic[d][n] TopicDistribution[d] Word[d][n] WordDistribution[WordTopic[d][n]]
  • 34. Learning Method 34 • Variational Bayes • Gibbs Sampling (MCMC) • Particle Filtering
  • 35. Learning Method 35 • Variational Bayes • Gibbs Sampling (MCMC) • Particle Filtering faster than Gibbs Sampling
  • 36. Mini-batch Online LDA 36 • Faster than Batch Algorithm • Less noise than pure Online LDA Pure Online Mini-batch Online Batch Batch Size
  • 37. 37 Implemented Model • Mini-Batch Map Model • For unknown data • Don t assume Vocabulary List • Mini-Batch Array Model (Other implementation) • For known data • Assume Vocabulary List
  • 38. • Mini-Batch Map Model • For unknown data • Don t assume Vocabulary List 38 Implemented Model • Mini-Batch Array Model (Other implementation) • For known data • Assume Vocabulary List
  • 39. • Meaning Less word • LDA: Clustering word by co-occurrence • a , the , I , He , is , in , on • Stop Word: Ignore them • TF-IDF: how important a word is to a document in a collection or dataset 39 Faced Implementation Problem
  • 40. 40 Faced Implementation Problem • Meaning Less word • LDA: Clustering word by co-occurrence • a , the , I , He , is , in , on • Stop Word: Ignore them • TF-IDF: how important a word is to a document in a collection or dataset
  • 41. • TF-IDF • can be calculated by Hivemall • Input Data: (DocId, Words) • https://github.com/myui/hivemall/wiki/ TFIDF-calculation 41 Faced Implementation Problem
  • 42. • 1 ["justice:0.1641245850805637","found:0.06564983513276658","discussion: 0.06564983513276658","law:0.065 • 64983513276658","based:0.06564983513276658","religion: 0.06564983513276658","viewpoints:0.03282491756638329"," • rationality:0.03282491756638329","including:0.03282491756638329","context: 0.03282491756638329","concept:0.032 • 82491756638329","rightness:0.03282491756638329","general: 0.03282491756638329","many:0.03282491756638329","dif • fering:0.03282491756638329","fairness:0.03282491756638329","social: 0.03282491756638329","broadest:0.032824917 • 56638329 ,"equity:0.03282491756638329","includes: 0.03282491756638329","theology:0.03282491756638329"] 42 Faced Implementation Problem • TF-IDF
  • 43. • Vocabulary List Model • Initialize all lambda for all words at first • if word does not appear in the Doc: • Lambda decreases at the same rate • No initialization problem 43 Faced Implementation Problem
  • 44. • Online Map Model • Initialize lambda when new word fetched • final lambda: 
 depend on the first appeared time • Initialize problem 44 Faced Implementation Problem
  • 45. • Prepared Dummy Lambda • Initialize dummy lambdas at first • Apply lambda update rule for dummy lambda 45 Faced Implementation Problem
  • 46. • Implicit Φ Normalization • Not written implicitly 46 Faced Implementation Problem
  • 47. • Implicit Φ Normalization • Not written implicitly 47 Faced Implementation Problem
  • 48. • Implicit Φ Normalization • Not written explicitly 48 Faced Implementation Problem
  • 49. 49 Faced Implementation Problem • Difficult Debugging • Circular reference Φ γ β :dependence
  • 50. • Data: 20News • Topic:6 • Iteration:10 50 Result: Online LDA
  • 51. • Topic:1 • No.0 writes[6]: 0.007909349 • No.1 article[7]: 0.006535292 • No.2 apr[3]: 0.0034389505 • No.3 team[4]: 0.00340712 • No.4 game[4]: 0.0033219245 • No.5 year[4]: 0.0032751847 • No.6 good[4]: 0.0032546786 • No.7 time[4]: 0.0030503264 • No.8 play[4]: 0.00262638 • No.9 games[5]: 0.002433915 • No.10 season[6]: 0.0022433712 • No.11 ll[2]: 0.0020719478 • No.12 players[7]: 0.0020332362 • No.13 win[3]: 0.0019284738 • No.14 hockey[6]: 0.0018870989 51 Result: Online LDA • No.15 league[6]: 0.0018450991 • No.16 baseball[8]: 0.0018226414 • No.17 years[5]: 0.0017960512 • No.18 mail[4]: 0.0017936684 • No.19 people[6]: 0.0017642054 • No.20 teams[5]: 0.0016675185 • No.21 great[5]: 0.001642102 • No.22 ve[2]: 0.0015846819 • No.23 point[5]: 0.0015730233 • No.24 cs[2]:0.0015609838 • No.25 didn[4]: 0.0015398773 • No.26 lot[3]: 0.0015123658 • No.27 mike[4]: 0.0014935194 • No.28 university[10]: 0.0014718652 • No.29 player[6]: 0.0014655796
  • 52. • Topic:1 • No.0 writes[6]: 0.007909349 • No.1 article[7]: 0.006535292 • No.2 apr[3]: 0.0034389505 • No.3 team[4]: 0.00340712 • No.4 game[4]: 0.0033219245 • No.5 year[4]: 0.0032751847 • No.6 good[4]: 0.0032546786 • No.7 time[4]: 0.0030503264 • No.8 play[4]: 0.00262638 • No.9 games[5]: 0.002433915 • No.10 season[6]: 0.0022433712 • No.11 ll[2]: 0.0020719478 • No.12 players[7]: 0.0020332362 • No.13 win[3]: 0.0019284738 • No.14 hockey[6]: 0.0018870989 52 Result: Online LDA • No.15 league[6]: 0.0018450991 • No.16 baseball[8]: 0.0018226414 • No.17 years[5]: 0.0017960512 • No.18 mail[4]: 0.0017936684 • No.19 people[6]: 0.0017642054 • No.20 teams[5]: 0.0016675185 • No.21 great[5]: 0.001642102 • No.22 ve[2]: 0.0015846819 • No.23 point[5]: 0.0015730233 • No.24 cs[2]:0.0015609838 • No.25 didn[4]: 0.0015398773 • No.26 lot[3]: 0.0015123658 • No.27 mike[4]: 0.0014935194 • No.28 university[10]: 0.0014718652 • No.29 player[6]: 0.0014655796 Sports
  • 53. • Topic:3 • No.0 writes[6]: 0.0065424195 • No.1 article[7]: 0.005621346 • No.2 apr[3]: 0.002746017 • No.3 work[4]: 0.002731466 • No.4 good[4]: 0.00266331 • No.5 ve[2]: 0.0025969497 • No.6 time[4]: 0.0025880735 • No.7 system[6]: 0.0024449623 • No.8 problem[7]: 0.002349667 • No.9 mail[4]: 0.0023234019 • No.10 windows[7]: 0.0021310966 • No.11 people[6]: 0.0018598152 • No.12 find[4]: 0.0018072439 • No.13 computer[8]: 0.0017470584 • No.14 email[5]: 0.0017204053 53 Result: Online LDA • No.15 drive[5]: 0.0017121765 • No.16 bit[3]: 0.0016401116 • No.17 program[7]: 0.001636191 • No.18 software[8]: 0.0016341405 • No.19 university[10]: 0.0015907411 • No.20 ll[2]: 0.0015530549 • No.21 thing[5]: 0.0015159848 • No.22 card[4]: 0.0013826761 • No.23 doesn[5]: 0.0013809163 • No.24 phone[5]: 0.0013786326 • No.25 question[8]: 0.0013721529 • No.26 internet[8]:0.001368883 • No.27 file[4]: 0.0013417117 • No.28 things[6]: 0.0013097903 • No.29 set[3]: 0.0013029057
  • 54. • Topic:3 • No.0 writes[6]: 0.0065424195 • No.1 article[7]: 0.005621346 • No.2 apr[3]: 0.002746017 • No.3 work[4]: 0.002731466 • No.4 good[4]: 0.00266331 • No.5 ve[2]: 0.0025969497 • No.6 time[4]: 0.0025880735 • No.7 system[6]: 0.0024449623 • No.8 problem[7]: 0.002349667 • No.9 mail[4]: 0.0023234019 • No.10 windows[7]: 0.0021310966 • No.11 people[6]: 0.0018598152 • No.12 find[4]: 0.0018072439 • No.13 computer[8]: 0.0017470584 • No.14 email[5]: 0.0017204053 54 Result: Online LDA • No.15 drive[5]: 0.0017121765 • No.16 bit[3]: 0.0016401116 • No.17 program[7]: 0.001636191 • No.18 software[8]: 0.0016341405 • No.19 university[10]: 0.0015907411 • No.20 ll[2]: 0.0015530549 • No.21 thing[5]: 0.0015159848 • No.22 card[4]: 0.0013826761 • No.23 doesn[5]: 0.0013809163 • No.24 phone[5]: 0.0013786326 • No.25 question[8]: 0.0013721529 • No.26 internet[8]:0.001368883 • No.27 file[4]: 0.0013417117 • No.28 things[6]: 0.0013097903 • No.29 set[3]: 0.0013029057 Computer
  • 55. Impression about Internship 55 • Machine Learning • Implementing ML algorithm from Scratch was fun • Contributing for OSS is precious experience for me
  • 56. Unfinished Business 56 • Documentation • write entry for FM/Online LDA • UDTF • build the function into Hivemall
  • 57. 57 • Thank you for Listening