SlideShare uma empresa Scribd logo
1 de 30
Baixar para ler offline
Kyunghoon Kim
Mathematical approach for Text Mining
- Standard Latent Semantic Indexing -
7/17/2014 Standard Latent Semantic Indexing 1
2014. 07. 17.
UNIST Mathematical Sciences
Kyunghoon Kim ( Kyunghoon@unist.ac.kr )
Kyunghoon Kim
What is the Indexing?
7/17/2014 Standard Latent Semantic Indexing 2
Google Glasses is a
computer with a
head-mounted
display.
He wore thick
glasses. He worked
in google corporation.
He wore glasses to
be able to read signs
at a distance.
google
glasses
is
a
computer
with
head-mounted
display
he
1 2
1 2 3
1
1 3
1
1
1
1
2 3
wore
thick
worked
in
corporation
to
be
able
read
…
2 3
2
2
2
2
3
3
3
3
1 2 3
Kyunghoon Kim
>>> Original
matrix([[1, 1, 0, 1],
[7, 0, 0, 7],
[1, 1, 0, 1],
[2, 5, 3, 6]])
>>> U, Sigma, VT = np.linalg.svd(Original)
SVD with Numpy
7/17/2014 Standard Latent Semantic Indexing 3
Kyunghoon Kim
Singular Value Decomposition(SVD)
7/17/2014 Standard Latent Semantic Indexing 4
Harrington, Peter. Machine learning in action. Manning Publications Co., 2012.
Kyunghoon Kim
>>> np.matrix(np.diag(Sigma))
matrix([
[ 1.218e+01, 0.0e+00, 0.0e+00, 0.0e+00],
[ 0.0e+00, 5.370e+00, 0.0e+00, 0.0e+00],
[ 0.0e+00, 0.0e+00, 8.823e-01, 0.0e+00],
[ 0.0e+00, 0.0e+00, 0.0e+00, 1.082e-15]])
Singular Values
7/17/2014 Standard Latent Semantic Indexing 5
Kyunghoon Kim
np.matrix(U)*np.matrix(np.diag(Sigma))*np.matrix(VT)
matrix([
[ 1.0e+00, 1.0e+00, -5.296e-16, 1.0e+00],
[ 7.0e+00, 4.302e-16, 7.979e-16, 7.0e+00],
[ 1.0e+00, 1.0e+00, -2.542e-17, 1.0e+00],
[ 2.0e+00, 5.0e+00, 3.0e+00, 6.0e+00]])
Full Recovery
7/17/2014 Standard Latent Semantic Indexing 6
matrix([[1, 1, 0, 1],
[7, 0, 0, 7],
[1, 1, 0, 1],
[2, 5, 3, 6]])
Kyunghoon Kim
# Calculation with all singular value
[[1 1 0 1]
[7 0 0 7]
[1 1 0 1]
[2 5 3 6]]
# Calculation with 3 of 4
[[1 1 0 1]
[7 0 0 7]
[1 1 0 1]
[2 5 3 6]]
Recovering with some singular values
7/17/2014 Standard Latent Semantic Indexing 7
# Calculation with 2 of 4
[[1 1 0 1]
[7 0 0 7]
[1 1 0 1]
[2 5 3 6]]
# Calculation with 1 of 4
[[1 0 0 1]
[5 3 1 7]
[1 0 0 1]
[4 2 1 6]]
Kyunghoon Kim
>>> sig2=Sigma**2
array([1.48e+02, 2.88e+01, 7.78e-01, 1.17e-30])
>>> sum(sig2)
178.0
>>> sum(sig2)*0.9
160.20000000000002
>>> sum(sig2[:1])
148.375554981108
How many take singular values
7/17/2014 Standard Latent Semantic Indexing 8
>>> sum(sig2[:2])
177.22150138532837
Kyunghoon Kim
Corpus
7/17/2014 Standard Latent Semantic Indexing 9
Kyunghoon Kim
Corpus
7/17/2014 Standard Latent Semantic Indexing 10
Kyunghoon Kim
Frequency Matrix
7/17/2014 Standard Latent Semantic Indexing 11
Kyunghoon Kim
• Each term 𝑡𝑡𝑖𝑖 generates a row vector (𝑎𝑎𝑖𝑖 𝑖, 𝑎𝑎𝑖𝑖 𝑖, ⋯ , 𝑎𝑎𝑖𝑖 𝑖𝑖)
referred to as a term vector and each document 𝑑𝑑𝑗𝑗 generates a
column vector
𝑑𝑑𝑗𝑗 =
𝑎𝑎1𝑗𝑗
⋮
𝑎𝑎 𝑚𝑚𝑚𝑚
Frequency Matrix
7/17/2014 Standard Latent Semantic Indexing 12
Kyunghoon Kim
Frequency Matrix
7/17/2014 Standard Latent Semantic Indexing 13
>>> A =
np.matrix([[1,0,0],[0,1,0],[1,1
,1],[1,1,0],[0,0,1]])
>>> A
matrix([[1, 0, 0],
[0, 1, 0],
[1, 1, 1],
[1, 1, 0],
[0, 0, 1]])
Kyunghoon Kim
U, Sigma, VT = np.linalg.svd(A)
S = np.zeros((U.shape[1],VT.shape[0]))
S[:3,:3] = np.diag(Sigma)
Recon = U*S*VT
print np.round(Recon)
Example of SVD :: Full Singular values
7/17/2014 Standard Latent Semantic Indexing 14
[[ 1. 0. 0.]
[ 0. 1. 0.]
[ 1. 1. 1.]
[ 1. 1. 0.]
[ 0. 0. 1.]]
Kyunghoon Kim
Singular Value Decomposition(SVD)
7/17/2014 Standard Latent Semantic Indexing 15
Harrington, Peter. Machine learning in action. Manning Publications Co., 2012.
Kyunghoon Kim
U, Sigma, VT = np.linalg.svd(A)
S = np.zeros((U.shape[1],VT.shape[0]))
S[:2,:2] = np.diag(Sigma[:2])
Recon = U*S*VT
print np.round(Recon,5)
Example of SVD :: 2 singular values
7/17/2014 Standard Latent Semantic Indexing 16
[[ 0.5 0.5 0.]
[ 0.5 0.5 0.]
[ 1. 1. 1.]
[ 1. 1. 0.]
[ 0. 0. 1.]]
Kyunghoon Kim
array([[ 0.5, 0.5, 0. ],
[ 0.5, 0.5, 0. ],
[ 1. , 1. , 1. ],
[ 1. , 1. , 0. ],
[ 0. , 0. , 1. ]]) % rounded Matrix for convenience
% not rounded Matrix
matrix([[ 5.00000000e-01, 5.00000000e-01, 5.27355937e-16],
[ 5.00000000e-01, 5.00000000e-01, -1.94289029e-16],
[ 1.00000000e+00, 1.00000000e+00, 1.00000000e+00],
[ 1.00000000e+00, 1.00000000e+00, 3.33066907e-16],
[ 5.55111512e-16, -2.49800181e-16, 1.00000000e+00]])
Example of SVD :: 2 singular values
7/17/2014 Standard Latent Semantic Indexing 17
Kyunghoon Kim
Query
7/17/2014 Standard Latent Semantic Indexing 18
Kyunghoon Kim
Query
7/17/2014 Standard Latent Semantic Indexing 19
Kyunghoon Kim
Case1.
Case2.
Example with Query
7/17/2014 Standard Latent Semantic Indexing 20
matrix([[ 5.00000000e-01, 5.00000000e-01, 5.27355937e-16],
[ 5.00000000e-01, 5.00000000e-01, -1.94289029e-16],
[ 1.00000000e+00, 1.00000000e+00, 1.00000000e+00],
[ 1.00000000e+00, 1.00000000e+00, 3.33066907e-16],
[ 5.55111512e-16, -2.49800181e-16, 1.00000000e+00]])
Kyunghoon Kim
Case1.
Example with Query
7/17/2014 Standard Latent Semantic Indexing 21
query = np.matrix([[1,0,0,1,0]])
for i in range(int(Recon.shape[1])):
q = query
d = Recon[:,i]
dotproduct = np.asscalar(np.dot(q,d))
normq = np.linalg.norm(q)
normd = np.linalg.norm(d)
print dotproduct / (normq*normd)
Kyunghoon Kim
Case1.
Case2.
Example with Query
7/17/2014 Standard Latent Semantic Indexing 22
matrix([[ 5.00000000e-01, 5.00000000e-01, 5.27355937e-16],
[ 5.00000000e-01, 5.00000000e-01, -1.94289029e-16],
[ 1.00000000e+00, 1.00000000e+00, 1.00000000e+00],
[ 1.00000000e+00, 1.00000000e+00, 3.33066907e-16],
[ 5.55111512e-16, -2.49800181e-16, 1.00000000e+00]])
Kyunghoon Kim
What’s the feature of LSI?
7/17/2014 Standard Latent Semantic Indexing 23
Appx of A = matrix([[ 0.5, 0.5, 0. ],
[ 0.5, 0.5, 0. ],
[ 1. , 1. , 1. ],
[ 1. , 1. , 0. ],
[ 0. , 0. , 1. ]])
Kyunghoon Kim
Related work
7/17/2014 Standard Latent Semantic Indexing 24
Kyunghoon Kim
Demonstration of LSI
7/17/2014 Standard Latent Semantic Indexing 25
Kyunghoon Kim7/17/2014 Standard Latent Semantic Indexing 26
Kyunghoon Kim7/17/2014 Standard Latent Semantic Indexing 27
Kyunghoon Kim7/17/2014 Standard Latent Semantic Indexing 28
Kyunghoon Kim
• Probabilistic Latent Semantic Indexing
• Latent Dirichlet Allocation
What’s Next?
7/17/2014 Standard Latent Semantic Indexing 29
Kyunghoon Kim
• Harrington, Peter. Machine learning in action.
Manning Publications Co., 2012.
• Simovici, Dan A. Linear algebra tools for data
mining. World Scientific, 2012.
• Berry, Michael W., Susan T. Dumais, and Gavin
W. O'Brien. "Using linear algebra for intelligent
information retrieval." SIAM review 37.4 (1995):
573-595.
References
7/17/2014 Standard Latent Semantic Indexing 30

Mais conteúdo relacionado

Mais procurados

Large axial displacement analysis of two elements truss with effect of materi...
Large axial displacement analysis of two elements truss with effect of materi...Large axial displacement analysis of two elements truss with effect of materi...
Large axial displacement analysis of two elements truss with effect of materi...Salar Delavar Qashqai
 
Paper id 71201927
Paper id 71201927Paper id 71201927
Paper id 71201927IJRAT
 
lecture 25
lecture 25lecture 25
lecture 25sajinsc
 
Some approximation properties of modified baskakov stancu operators
Some approximation properties of modified baskakov stancu operatorsSome approximation properties of modified baskakov stancu operators
Some approximation properties of modified baskakov stancu operatorseSAT Journals
 
Fixed point theorems for random variables in complete metric spaces
Fixed point theorems for random variables in complete metric spacesFixed point theorems for random variables in complete metric spaces
Fixed point theorems for random variables in complete metric spacesAlexander Decker
 
Smooth Pinball based Quantile Neural Network
Smooth Pinball based Quantile Neural NetworkSmooth Pinball based Quantile Neural Network
Smooth Pinball based Quantile Neural NetworkKostas Hatalis, PhD
 
Multimodal Residual Learning for Visual Question-Answering
Multimodal Residual Learning for Visual Question-AnsweringMultimodal Residual Learning for Visual Question-Answering
Multimodal Residual Learning for Visual Question-AnsweringNAVER D2
 

Mais procurados (7)

Large axial displacement analysis of two elements truss with effect of materi...
Large axial displacement analysis of two elements truss with effect of materi...Large axial displacement analysis of two elements truss with effect of materi...
Large axial displacement analysis of two elements truss with effect of materi...
 
Paper id 71201927
Paper id 71201927Paper id 71201927
Paper id 71201927
 
lecture 25
lecture 25lecture 25
lecture 25
 
Some approximation properties of modified baskakov stancu operators
Some approximation properties of modified baskakov stancu operatorsSome approximation properties of modified baskakov stancu operators
Some approximation properties of modified baskakov stancu operators
 
Fixed point theorems for random variables in complete metric spaces
Fixed point theorems for random variables in complete metric spacesFixed point theorems for random variables in complete metric spaces
Fixed point theorems for random variables in complete metric spaces
 
Smooth Pinball based Quantile Neural Network
Smooth Pinball based Quantile Neural NetworkSmooth Pinball based Quantile Neural Network
Smooth Pinball based Quantile Neural Network
 
Multimodal Residual Learning for Visual Question-Answering
Multimodal Residual Learning for Visual Question-AnsweringMultimodal Residual Learning for Visual Question-Answering
Multimodal Residual Learning for Visual Question-Answering
 

Destaque

SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...Damiano Spina
 
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)rchbeir
 
AutoCardSorter - Designing the Information Architecture of a web site using L...
AutoCardSorter - Designing the Information Architecture of a web site using L...AutoCardSorter - Designing the Information Architecture of a web site using L...
AutoCardSorter - Designing the Information Architecture of a web site using L...Christos Katsanos
 
Analysis of Reviews on Sony Z3
Analysis of Reviews on Sony Z3Analysis of Reviews on Sony Z3
Analysis of Reviews on Sony Z3Krishna Bollojula
 
Recommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human CategorizationRecommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human CategorizationChristoph Trattner
 
Latent Semantic Indexing and Search Engines Optimimization (SEO)
Latent Semantic Indexing and Search Engines Optimimization (SEO)Latent Semantic Indexing and Search Engines Optimimization (SEO)
Latent Semantic Indexing and Search Engines Optimimization (SEO)muzzy4friends
 
20 cv mil_models_for_words
20 cv mil_models_for_words20 cv mil_models_for_words
20 cv mil_models_for_wordszukun
 
Mining Features from the Object-Oriented Source Code of a Collection of Softw...
Mining Features from the Object-Oriented Source Code of a Collection of Softw...Mining Features from the Object-Oriented Source Code of a Collection of Softw...
Mining Features from the Object-Oriented Source Code of a Collection of Softw...Ra'Fat Al-Msie'deen
 
SNAPP - Learning Analytics and Knowledge Conference 2011
SNAPP - Learning Analytics and Knowledge Conference 2011SNAPP - Learning Analytics and Knowledge Conference 2011
SNAPP - Learning Analytics and Knowledge Conference 2011aneeshabakharia
 
A Semantics-based Approach to Machine Perception
A Semantics-based Approach to Machine PerceptionA Semantics-based Approach to Machine Perception
A Semantics-based Approach to Machine PerceptionCory Andrew Henson
 
Latent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet MixtureLatent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet MixtureRakuten Group, Inc.
 
BigML Summer 2016 Release
BigML Summer 2016 ReleaseBigML Summer 2016 Release
BigML Summer 2016 ReleaseBigML, Inc
 
An approach to source code plagiarism
An approach to source code plagiarismAn approach to source code plagiarism
An approach to source code plagiarismvarsha_bhat
 
Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes
Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet ProcessesBayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes
Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet ProcessesJinYeong Bak
 
Blei ngjordan2003
Blei ngjordan2003Blei ngjordan2003
Blei ngjordan2003Ajay Ohri
 
How to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi
How to use Latent Semantic Analysis to Glean Real Insight - Franco AmalfiHow to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi
How to use Latent Semantic Analysis to Glean Real Insight - Franco AmalfiSocial Media Camp
 
Latent Topic-semantic Indexing based Automatic Text Summarization
Latent Topic-semantic Indexing based Automatic Text SummarizationLatent Topic-semantic Indexing based Automatic Text Summarization
Latent Topic-semantic Indexing based Automatic Text SummarizationElaheh Barati
 

Destaque (20)

SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
 
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
 
AutoCardSorter - Designing the Information Architecture of a web site using L...
AutoCardSorter - Designing the Information Architecture of a web site using L...AutoCardSorter - Designing the Information Architecture of a web site using L...
AutoCardSorter - Designing the Information Architecture of a web site using L...
 
Analysis of Reviews on Sony Z3
Analysis of Reviews on Sony Z3Analysis of Reviews on Sony Z3
Analysis of Reviews on Sony Z3
 
Recommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human CategorizationRecommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human Categorization
 
Latent Semantic Indexing and Search Engines Optimimization (SEO)
Latent Semantic Indexing and Search Engines Optimimization (SEO)Latent Semantic Indexing and Search Engines Optimimization (SEO)
Latent Semantic Indexing and Search Engines Optimimization (SEO)
 
20 cv mil_models_for_words
20 cv mil_models_for_words20 cv mil_models_for_words
20 cv mil_models_for_words
 
Geometric Aspects of LSA
Geometric Aspects of LSAGeometric Aspects of LSA
Geometric Aspects of LSA
 
Practical Machine Learning
Practical Machine Learning Practical Machine Learning
Practical Machine Learning
 
Mining Features from the Object-Oriented Source Code of a Collection of Softw...
Mining Features from the Object-Oriented Source Code of a Collection of Softw...Mining Features from the Object-Oriented Source Code of a Collection of Softw...
Mining Features from the Object-Oriented Source Code of a Collection of Softw...
 
SNAPP - Learning Analytics and Knowledge Conference 2011
SNAPP - Learning Analytics and Knowledge Conference 2011SNAPP - Learning Analytics and Knowledge Conference 2011
SNAPP - Learning Analytics and Knowledge Conference 2011
 
A Semantics-based Approach to Machine Perception
A Semantics-based Approach to Machine PerceptionA Semantics-based Approach to Machine Perception
A Semantics-based Approach to Machine Perception
 
Latent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet MixtureLatent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet Mixture
 
BigML Summer 2016 Release
BigML Summer 2016 ReleaseBigML Summer 2016 Release
BigML Summer 2016 Release
 
An approach to source code plagiarism
An approach to source code plagiarismAn approach to source code plagiarism
An approach to source code plagiarism
 
Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes
Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet ProcessesBayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes
Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes
 
Blei ngjordan2003
Blei ngjordan2003Blei ngjordan2003
Blei ngjordan2003
 
How to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi
How to use Latent Semantic Analysis to Glean Real Insight - Franco AmalfiHow to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi
How to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi
 
Latent Topic-semantic Indexing based Automatic Text Summarization
Latent Topic-semantic Indexing based Automatic Text SummarizationLatent Topic-semantic Indexing based Automatic Text Summarization
Latent Topic-semantic Indexing based Automatic Text Summarization
 
Naive Bayes | Statistics
Naive Bayes | StatisticsNaive Bayes | Statistics
Naive Bayes | Statistics
 

Semelhante a Mathematical approach for Text Mining 1

Operations in Digital Image Processing + Convolution by Example
Operations in Digital Image Processing + Convolution by ExampleOperations in Digital Image Processing + Convolution by Example
Operations in Digital Image Processing + Convolution by ExampleAhmed Gad
 
Product Design Forecasting Techniquesision.ppt
Product Design Forecasting Techniquesision.pptProduct Design Forecasting Techniquesision.ppt
Product Design Forecasting Techniquesision.pptavidc1000
 
Breaking a Stick to form a Pentagon with Positive Integers using Programming ...
Breaking a Stick to form a Pentagon with Positive Integers using Programming ...Breaking a Stick to form a Pentagon with Positive Integers using Programming ...
Breaking a Stick to form a Pentagon with Positive Integers using Programming ...IRJET Journal
 
Forecasting_Quantitative Forecasting.ppt
Forecasting_Quantitative Forecasting.pptForecasting_Quantitative Forecasting.ppt
Forecasting_Quantitative Forecasting.pptRituparnaDas584083
 
第5回 様々なファイル形式の読み込みとデータの書き出し(解答付き)
第5回 様々なファイル形式の読み込みとデータの書き出し(解答付き)第5回 様々なファイル形式の読み込みとデータの書き出し(解答付き)
第5回 様々なファイル形式の読み込みとデータの書き出し(解答付き)Wataru Shito
 
Algorithms lecture 3
Algorithms lecture 3Algorithms lecture 3
Algorithms lecture 3Mimi Haque
 
Sentimental analysis of financial articles using neural network
Sentimental analysis of financial articles using neural networkSentimental analysis of financial articles using neural network
Sentimental analysis of financial articles using neural networkBhavyateja Potineni
 
CS-102 DS-class_01_02 Lectures Data .pdf
CS-102 DS-class_01_02 Lectures Data .pdfCS-102 DS-class_01_02 Lectures Data .pdf
CS-102 DS-class_01_02 Lectures Data .pdfssuser034ce1
 
Forecasting
ForecastingForecasting
ForecastingSVGANGAD
 
Time and cost optimization of business process RMA using PERT and goal progra...
Time and cost optimization of business process RMA using PERT and goal progra...Time and cost optimization of business process RMA using PERT and goal progra...
Time and cost optimization of business process RMA using PERT and goal progra...TELKOMNIKA JOURNAL
 
Introducing R package ESG at Rmetrics Paris 2014 conference
Introducing R package ESG at Rmetrics Paris 2014 conferenceIntroducing R package ESG at Rmetrics Paris 2014 conference
Introducing R package ESG at Rmetrics Paris 2014 conferenceThierry Moudiki
 
Optimization of a 2D localization system of a moving object based on the prop...
Optimization of a 2D localization system of a moving object based on the prop...Optimization of a 2D localization system of a moving object based on the prop...
Optimization of a 2D localization system of a moving object based on the prop...IRJET Journal
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with RYanchang Zhao
 
第5回 様々なファイル形式の読み込みとデータの書き出し
第5回 様々なファイル形式の読み込みとデータの書き出し第5回 様々なファイル形式の読み込みとデータの書き出し
第5回 様々なファイル形式の読み込みとデータの書き出しWataru Shito
 
Advanced Econometrics L7-8.pptx
Advanced Econometrics L7-8.pptxAdvanced Econometrics L7-8.pptx
Advanced Econometrics L7-8.pptxakashayosha
 
(KO) 온라인 뉴스 댓글 플랫폼을 흐리는 어뷰저 분석기 / (EN) Online ...
(KO) 온라인 뉴스 댓글 플랫폼을 흐리는 어뷰저 분석기 / (EN) Online ...(KO) 온라인 뉴스 댓글 플랫폼을 흐리는 어뷰저 분석기 / (EN) Online ...
(KO) 온라인 뉴스 댓글 플랫폼을 흐리는 어뷰저 분석기 / (EN) Online ...Ji Hyung Moon
 
Approximate Dynamic Programming: A New Paradigm for Process Control & Optimiz...
Approximate Dynamic Programming: A New Paradigm for Process Control & Optimiz...Approximate Dynamic Programming: A New Paradigm for Process Control & Optimiz...
Approximate Dynamic Programming: A New Paradigm for Process Control & Optimiz...height
 
Stock price prediction using k* nearest neighbors and indexing dynamic time w...
Stock price prediction using k* nearest neighbors and indexing dynamic time w...Stock price prediction using k* nearest neighbors and indexing dynamic time w...
Stock price prediction using k* nearest neighbors and indexing dynamic time w...Kei Nakagawa
 

Semelhante a Mathematical approach for Text Mining 1 (20)

Operations in Digital Image Processing + Convolution by Example
Operations in Digital Image Processing + Convolution by ExampleOperations in Digital Image Processing + Convolution by Example
Operations in Digital Image Processing + Convolution by Example
 
Product Design Forecasting Techniquesision.ppt
Product Design Forecasting Techniquesision.pptProduct Design Forecasting Techniquesision.ppt
Product Design Forecasting Techniquesision.ppt
 
Breaking a Stick to form a Pentagon with Positive Integers using Programming ...
Breaking a Stick to form a Pentagon with Positive Integers using Programming ...Breaking a Stick to form a Pentagon with Positive Integers using Programming ...
Breaking a Stick to form a Pentagon with Positive Integers using Programming ...
 
Forecasting_Quantitative Forecasting.ppt
Forecasting_Quantitative Forecasting.pptForecasting_Quantitative Forecasting.ppt
Forecasting_Quantitative Forecasting.ppt
 
Complexity analysis
Complexity analysisComplexity analysis
Complexity analysis
 
第5回 様々なファイル形式の読み込みとデータの書き出し(解答付き)
第5回 様々なファイル形式の読み込みとデータの書き出し(解答付き)第5回 様々なファイル形式の読み込みとデータの書き出し(解答付き)
第5回 様々なファイル形式の読み込みとデータの書き出し(解答付き)
 
Algorithms lecture 3
Algorithms lecture 3Algorithms lecture 3
Algorithms lecture 3
 
Forecasting.ppt
Forecasting.pptForecasting.ppt
Forecasting.ppt
 
Sentimental analysis of financial articles using neural network
Sentimental analysis of financial articles using neural networkSentimental analysis of financial articles using neural network
Sentimental analysis of financial articles using neural network
 
CS-102 DS-class_01_02 Lectures Data .pdf
CS-102 DS-class_01_02 Lectures Data .pdfCS-102 DS-class_01_02 Lectures Data .pdf
CS-102 DS-class_01_02 Lectures Data .pdf
 
Forecasting
ForecastingForecasting
Forecasting
 
Time and cost optimization of business process RMA using PERT and goal progra...
Time and cost optimization of business process RMA using PERT and goal progra...Time and cost optimization of business process RMA using PERT and goal progra...
Time and cost optimization of business process RMA using PERT and goal progra...
 
Introducing R package ESG at Rmetrics Paris 2014 conference
Introducing R package ESG at Rmetrics Paris 2014 conferenceIntroducing R package ESG at Rmetrics Paris 2014 conference
Introducing R package ESG at Rmetrics Paris 2014 conference
 
Optimization of a 2D localization system of a moving object based on the prop...
Optimization of a 2D localization system of a moving object based on the prop...Optimization of a 2D localization system of a moving object based on the prop...
Optimization of a 2D localization system of a moving object based on the prop...
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with R
 
第5回 様々なファイル形式の読み込みとデータの書き出し
第5回 様々なファイル形式の読み込みとデータの書き出し第5回 様々なファイル形式の読み込みとデータの書き出し
第5回 様々なファイル形式の読み込みとデータの書き出し
 
Advanced Econometrics L7-8.pptx
Advanced Econometrics L7-8.pptxAdvanced Econometrics L7-8.pptx
Advanced Econometrics L7-8.pptx
 
(KO) 온라인 뉴스 댓글 플랫폼을 흐리는 어뷰저 분석기 / (EN) Online ...
(KO) 온라인 뉴스 댓글 플랫폼을 흐리는 어뷰저 분석기 / (EN) Online ...(KO) 온라인 뉴스 댓글 플랫폼을 흐리는 어뷰저 분석기 / (EN) Online ...
(KO) 온라인 뉴스 댓글 플랫폼을 흐리는 어뷰저 분석기 / (EN) Online ...
 
Approximate Dynamic Programming: A New Paradigm for Process Control & Optimiz...
Approximate Dynamic Programming: A New Paradigm for Process Control & Optimiz...Approximate Dynamic Programming: A New Paradigm for Process Control & Optimiz...
Approximate Dynamic Programming: A New Paradigm for Process Control & Optimiz...
 
Stock price prediction using k* nearest neighbors and indexing dynamic time w...
Stock price prediction using k* nearest neighbors and indexing dynamic time w...Stock price prediction using k* nearest neighbors and indexing dynamic time w...
Stock price prediction using k* nearest neighbors and indexing dynamic time w...
 

Mais de Kyunghoon Kim

넥스트 노멀 - 인간과 AI의 협업
넥스트 노멀 - 인간과 AI의 협업넥스트 노멀 - 인간과 AI의 협업
넥스트 노멀 - 인간과 AI의 협업Kyunghoon Kim
 
토론하는 AI 김컴재와 AI 조향사 센트리아
토론하는 AI 김컴재와 AI 조향사 센트리아토론하는 AI 김컴재와 AI 조향사 센트리아
토론하는 AI 김컴재와 AI 조향사 센트리아Kyunghoon Kim
 
빅데이터의 다음 단계는 예측 분석이다
빅데이터의 다음 단계는 예측 분석이다빅데이터의 다음 단계는 예측 분석이다
빅데이터의 다음 단계는 예측 분석이다Kyunghoon Kim
 
중학생을 위한 4차 산업혁명 시대의 인공지능 이야기
중학생을 위한 4차 산업혁명 시대의 인공지능 이야기중학생을 위한 4차 산업혁명 시대의 인공지능 이야기
중학생을 위한 4차 산업혁명 시대의 인공지능 이야기Kyunghoon Kim
 
4차 산업혁명 시대의 진로와 진학
4차 산업혁명 시대의 진로와 진학4차 산업혁명 시대의 진로와 진학
4차 산업혁명 시대의 진로와 진학Kyunghoon Kim
 
20200620 신호와 소음 독서토론
20200620 신호와 소음 독서토론20200620 신호와 소음 독서토론
20200620 신호와 소음 독서토론Kyunghoon Kim
 
중학생을 위한 인공지능 이야기
중학생을 위한 인공지능 이야기중학생을 위한 인공지능 이야기
중학생을 위한 인공지능 이야기Kyunghoon Kim
 
슬쩍 해보는 선형대수학
슬쩍 해보는 선형대수학슬쩍 해보는 선형대수학
슬쩍 해보는 선형대수학Kyunghoon Kim
 
파이썬으로 해보는 이미지 처리
파이썬으로 해보는 이미지 처리파이썬으로 해보는 이미지 처리
파이썬으로 해보는 이미지 처리Kyunghoon Kim
 
기계가 선형대수학을 통해 한국어를 이해하는 방법
기계가 선형대수학을 통해 한국어를 이해하는 방법기계가 선형대수학을 통해 한국어를 이해하는 방법
기계가 선형대수학을 통해 한국어를 이해하는 방법Kyunghoon Kim
 
공공데이터 활용사례
공공데이터 활용사례공공데이터 활용사례
공공데이터 활용사례Kyunghoon Kim
 
기계학습, 딥러닝, 인공지능 사이의 차이점 이해하기
기계학습, 딥러닝, 인공지능 사이의 차이점 이해하기기계학습, 딥러닝, 인공지능 사이의 차이점 이해하기
기계학습, 딥러닝, 인공지능 사이의 차이점 이해하기Kyunghoon Kim
 
2018 인공지능에 대하여
2018 인공지능에 대하여2018 인공지능에 대하여
2018 인공지능에 대하여Kyunghoon Kim
 
Naive bayes Classification using Python3
Naive bayes Classification using Python3Naive bayes Classification using Python3
Naive bayes Classification using Python3Kyunghoon Kim
 
Basic statistics using Python3
Basic statistics using Python3Basic statistics using Python3
Basic statistics using Python3Kyunghoon Kim
 
[20160813, PyCon2016APAC] 뉴스를 재미있게 만드는 방법; 뉴스잼
[20160813, PyCon2016APAC] 뉴스를 재미있게 만드는 방법; 뉴스잼[20160813, PyCon2016APAC] 뉴스를 재미있게 만드는 방법; 뉴스잼
[20160813, PyCon2016APAC] 뉴스를 재미있게 만드는 방법; 뉴스잼Kyunghoon Kim
 
사회 연결망의 링크 예측
사회 연결망의 링크 예측사회 연결망의 링크 예측
사회 연결망의 링크 예측Kyunghoon Kim
 

Mais de Kyunghoon Kim (20)

넥스트 노멀 - 인간과 AI의 협업
넥스트 노멀 - 인간과 AI의 협업넥스트 노멀 - 인간과 AI의 협업
넥스트 노멀 - 인간과 AI의 협업
 
토론하는 AI 김컴재와 AI 조향사 센트리아
토론하는 AI 김컴재와 AI 조향사 센트리아토론하는 AI 김컴재와 AI 조향사 센트리아
토론하는 AI 김컴재와 AI 조향사 센트리아
 
빅데이터의 다음 단계는 예측 분석이다
빅데이터의 다음 단계는 예측 분석이다빅데이터의 다음 단계는 예측 분석이다
빅데이터의 다음 단계는 예측 분석이다
 
중학생을 위한 4차 산업혁명 시대의 인공지능 이야기
중학생을 위한 4차 산업혁명 시대의 인공지능 이야기중학생을 위한 4차 산업혁명 시대의 인공지능 이야기
중학생을 위한 4차 산업혁명 시대의 인공지능 이야기
 
업무 자동화
업무 자동화업무 자동화
업무 자동화
 
4차 산업혁명 시대의 진로와 진학
4차 산업혁명 시대의 진로와 진학4차 산업혁명 시대의 진로와 진학
4차 산업혁명 시대의 진로와 진학
 
20200620 신호와 소음 독서토론
20200620 신호와 소음 독서토론20200620 신호와 소음 독서토론
20200620 신호와 소음 독서토론
 
중학생을 위한 인공지능 이야기
중학생을 위한 인공지능 이야기중학생을 위한 인공지능 이야기
중학생을 위한 인공지능 이야기
 
슬쩍 해보는 선형대수학
슬쩍 해보는 선형대수학슬쩍 해보는 선형대수학
슬쩍 해보는 선형대수학
 
파이썬으로 해보는 이미지 처리
파이썬으로 해보는 이미지 처리파이썬으로 해보는 이미지 처리
파이썬으로 해보는 이미지 처리
 
기계가 선형대수학을 통해 한국어를 이해하는 방법
기계가 선형대수학을 통해 한국어를 이해하는 방법기계가 선형대수학을 통해 한국어를 이해하는 방법
기계가 선형대수학을 통해 한국어를 이해하는 방법
 
공공데이터 활용사례
공공데이터 활용사례공공데이터 활용사례
공공데이터 활용사례
 
기계학습, 딥러닝, 인공지능 사이의 차이점 이해하기
기계학습, 딥러닝, 인공지능 사이의 차이점 이해하기기계학습, 딥러닝, 인공지능 사이의 차이점 이해하기
기계학습, 딥러닝, 인공지능 사이의 차이점 이해하기
 
Korean Text mining
Korean Text miningKorean Text mining
Korean Text mining
 
2018 인공지능에 대하여
2018 인공지능에 대하여2018 인공지능에 대하여
2018 인공지능에 대하여
 
Naive bayes Classification using Python3
Naive bayes Classification using Python3Naive bayes Classification using Python3
Naive bayes Classification using Python3
 
Basic statistics using Python3
Basic statistics using Python3Basic statistics using Python3
Basic statistics using Python3
 
[20160813, PyCon2016APAC] 뉴스를 재미있게 만드는 방법; 뉴스잼
[20160813, PyCon2016APAC] 뉴스를 재미있게 만드는 방법; 뉴스잼[20160813, PyCon2016APAC] 뉴스를 재미있게 만드는 방법; 뉴스잼
[20160813, PyCon2016APAC] 뉴스를 재미있게 만드는 방법; 뉴스잼
 
Topic Modeling
Topic ModelingTopic Modeling
Topic Modeling
 
사회 연결망의 링크 예측
사회 연결망의 링크 예측사회 연결망의 링크 예측
사회 연결망의 링크 예측
 

Mathematical approach for Text Mining 1

  • 1. Kyunghoon Kim Mathematical approach for Text Mining - Standard Latent Semantic Indexing - 7/17/2014 Standard Latent Semantic Indexing 1 2014. 07. 17. UNIST Mathematical Sciences Kyunghoon Kim ( Kyunghoon@unist.ac.kr )
  • 2. Kyunghoon Kim What is the Indexing? 7/17/2014 Standard Latent Semantic Indexing 2 Google Glasses is a computer with a head-mounted display. He wore thick glasses. He worked in google corporation. He wore glasses to be able to read signs at a distance. google glasses is a computer with head-mounted display he 1 2 1 2 3 1 1 3 1 1 1 1 2 3 wore thick worked in corporation to be able read … 2 3 2 2 2 2 3 3 3 3 1 2 3
  • 3. Kyunghoon Kim >>> Original matrix([[1, 1, 0, 1], [7, 0, 0, 7], [1, 1, 0, 1], [2, 5, 3, 6]]) >>> U, Sigma, VT = np.linalg.svd(Original) SVD with Numpy 7/17/2014 Standard Latent Semantic Indexing 3
  • 4. Kyunghoon Kim Singular Value Decomposition(SVD) 7/17/2014 Standard Latent Semantic Indexing 4 Harrington, Peter. Machine learning in action. Manning Publications Co., 2012.
  • 5. Kyunghoon Kim >>> np.matrix(np.diag(Sigma)) matrix([ [ 1.218e+01, 0.0e+00, 0.0e+00, 0.0e+00], [ 0.0e+00, 5.370e+00, 0.0e+00, 0.0e+00], [ 0.0e+00, 0.0e+00, 8.823e-01, 0.0e+00], [ 0.0e+00, 0.0e+00, 0.0e+00, 1.082e-15]]) Singular Values 7/17/2014 Standard Latent Semantic Indexing 5
  • 6. Kyunghoon Kim np.matrix(U)*np.matrix(np.diag(Sigma))*np.matrix(VT) matrix([ [ 1.0e+00, 1.0e+00, -5.296e-16, 1.0e+00], [ 7.0e+00, 4.302e-16, 7.979e-16, 7.0e+00], [ 1.0e+00, 1.0e+00, -2.542e-17, 1.0e+00], [ 2.0e+00, 5.0e+00, 3.0e+00, 6.0e+00]]) Full Recovery 7/17/2014 Standard Latent Semantic Indexing 6 matrix([[1, 1, 0, 1], [7, 0, 0, 7], [1, 1, 0, 1], [2, 5, 3, 6]])
  • 7. Kyunghoon Kim # Calculation with all singular value [[1 1 0 1] [7 0 0 7] [1 1 0 1] [2 5 3 6]] # Calculation with 3 of 4 [[1 1 0 1] [7 0 0 7] [1 1 0 1] [2 5 3 6]] Recovering with some singular values 7/17/2014 Standard Latent Semantic Indexing 7 # Calculation with 2 of 4 [[1 1 0 1] [7 0 0 7] [1 1 0 1] [2 5 3 6]] # Calculation with 1 of 4 [[1 0 0 1] [5 3 1 7] [1 0 0 1] [4 2 1 6]]
  • 8. Kyunghoon Kim >>> sig2=Sigma**2 array([1.48e+02, 2.88e+01, 7.78e-01, 1.17e-30]) >>> sum(sig2) 178.0 >>> sum(sig2)*0.9 160.20000000000002 >>> sum(sig2[:1]) 148.375554981108 How many take singular values 7/17/2014 Standard Latent Semantic Indexing 8 >>> sum(sig2[:2]) 177.22150138532837
  • 9. Kyunghoon Kim Corpus 7/17/2014 Standard Latent Semantic Indexing 9
  • 10. Kyunghoon Kim Corpus 7/17/2014 Standard Latent Semantic Indexing 10
  • 11. Kyunghoon Kim Frequency Matrix 7/17/2014 Standard Latent Semantic Indexing 11
  • 12. Kyunghoon Kim • Each term 𝑡𝑡𝑖𝑖 generates a row vector (𝑎𝑎𝑖𝑖 𝑖, 𝑎𝑎𝑖𝑖 𝑖, ⋯ , 𝑎𝑎𝑖𝑖 𝑖𝑖) referred to as a term vector and each document 𝑑𝑑𝑗𝑗 generates a column vector 𝑑𝑑𝑗𝑗 = 𝑎𝑎1𝑗𝑗 ⋮ 𝑎𝑎 𝑚𝑚𝑚𝑚 Frequency Matrix 7/17/2014 Standard Latent Semantic Indexing 12
  • 13. Kyunghoon Kim Frequency Matrix 7/17/2014 Standard Latent Semantic Indexing 13 >>> A = np.matrix([[1,0,0],[0,1,0],[1,1 ,1],[1,1,0],[0,0,1]]) >>> A matrix([[1, 0, 0], [0, 1, 0], [1, 1, 1], [1, 1, 0], [0, 0, 1]])
  • 14. Kyunghoon Kim U, Sigma, VT = np.linalg.svd(A) S = np.zeros((U.shape[1],VT.shape[0])) S[:3,:3] = np.diag(Sigma) Recon = U*S*VT print np.round(Recon) Example of SVD :: Full Singular values 7/17/2014 Standard Latent Semantic Indexing 14 [[ 1. 0. 0.] [ 0. 1. 0.] [ 1. 1. 1.] [ 1. 1. 0.] [ 0. 0. 1.]]
  • 15. Kyunghoon Kim Singular Value Decomposition(SVD) 7/17/2014 Standard Latent Semantic Indexing 15 Harrington, Peter. Machine learning in action. Manning Publications Co., 2012.
  • 16. Kyunghoon Kim U, Sigma, VT = np.linalg.svd(A) S = np.zeros((U.shape[1],VT.shape[0])) S[:2,:2] = np.diag(Sigma[:2]) Recon = U*S*VT print np.round(Recon,5) Example of SVD :: 2 singular values 7/17/2014 Standard Latent Semantic Indexing 16 [[ 0.5 0.5 0.] [ 0.5 0.5 0.] [ 1. 1. 1.] [ 1. 1. 0.] [ 0. 0. 1.]]
  • 17. Kyunghoon Kim array([[ 0.5, 0.5, 0. ], [ 0.5, 0.5, 0. ], [ 1. , 1. , 1. ], [ 1. , 1. , 0. ], [ 0. , 0. , 1. ]]) % rounded Matrix for convenience % not rounded Matrix matrix([[ 5.00000000e-01, 5.00000000e-01, 5.27355937e-16], [ 5.00000000e-01, 5.00000000e-01, -1.94289029e-16], [ 1.00000000e+00, 1.00000000e+00, 1.00000000e+00], [ 1.00000000e+00, 1.00000000e+00, 3.33066907e-16], [ 5.55111512e-16, -2.49800181e-16, 1.00000000e+00]]) Example of SVD :: 2 singular values 7/17/2014 Standard Latent Semantic Indexing 17
  • 18. Kyunghoon Kim Query 7/17/2014 Standard Latent Semantic Indexing 18
  • 19. Kyunghoon Kim Query 7/17/2014 Standard Latent Semantic Indexing 19
  • 20. Kyunghoon Kim Case1. Case2. Example with Query 7/17/2014 Standard Latent Semantic Indexing 20 matrix([[ 5.00000000e-01, 5.00000000e-01, 5.27355937e-16], [ 5.00000000e-01, 5.00000000e-01, -1.94289029e-16], [ 1.00000000e+00, 1.00000000e+00, 1.00000000e+00], [ 1.00000000e+00, 1.00000000e+00, 3.33066907e-16], [ 5.55111512e-16, -2.49800181e-16, 1.00000000e+00]])
  • 21. Kyunghoon Kim Case1. Example with Query 7/17/2014 Standard Latent Semantic Indexing 21 query = np.matrix([[1,0,0,1,0]]) for i in range(int(Recon.shape[1])): q = query d = Recon[:,i] dotproduct = np.asscalar(np.dot(q,d)) normq = np.linalg.norm(q) normd = np.linalg.norm(d) print dotproduct / (normq*normd)
  • 22. Kyunghoon Kim Case1. Case2. Example with Query 7/17/2014 Standard Latent Semantic Indexing 22 matrix([[ 5.00000000e-01, 5.00000000e-01, 5.27355937e-16], [ 5.00000000e-01, 5.00000000e-01, -1.94289029e-16], [ 1.00000000e+00, 1.00000000e+00, 1.00000000e+00], [ 1.00000000e+00, 1.00000000e+00, 3.33066907e-16], [ 5.55111512e-16, -2.49800181e-16, 1.00000000e+00]])
  • 23. Kyunghoon Kim What’s the feature of LSI? 7/17/2014 Standard Latent Semantic Indexing 23 Appx of A = matrix([[ 0.5, 0.5, 0. ], [ 0.5, 0.5, 0. ], [ 1. , 1. , 1. ], [ 1. , 1. , 0. ], [ 0. , 0. , 1. ]])
  • 24. Kyunghoon Kim Related work 7/17/2014 Standard Latent Semantic Indexing 24
  • 25. Kyunghoon Kim Demonstration of LSI 7/17/2014 Standard Latent Semantic Indexing 25
  • 26. Kyunghoon Kim7/17/2014 Standard Latent Semantic Indexing 26
  • 27. Kyunghoon Kim7/17/2014 Standard Latent Semantic Indexing 27
  • 28. Kyunghoon Kim7/17/2014 Standard Latent Semantic Indexing 28
  • 29. Kyunghoon Kim • Probabilistic Latent Semantic Indexing • Latent Dirichlet Allocation What’s Next? 7/17/2014 Standard Latent Semantic Indexing 29
  • 30. Kyunghoon Kim • Harrington, Peter. Machine learning in action. Manning Publications Co., 2012. • Simovici, Dan A. Linear algebra tools for data mining. World Scientific, 2012. • Berry, Michael W., Susan T. Dumais, and Gavin W. O'Brien. "Using linear algebra for intelligent information retrieval." SIAM review 37.4 (1995): 573-595. References 7/17/2014 Standard Latent Semantic Indexing 30