1. Kyunghoon Kim
Mathematical approach for Text Mining
- Standard Latent Semantic Indexing -
7/17/2014 Standard Latent Semantic Indexing 1
2014. 07. 17.
UNIST Mathematical Sciences
Kyunghoon Kim ( Kyunghoon@unist.ac.kr )
2. Kyunghoon Kim
What is the Indexing?
7/17/2014 Standard Latent Semantic Indexing 2
Google Glasses is a
computer with a
head-mounted
display.
He wore thick
glasses. He worked
in google corporation.
He wore glasses to
be able to read signs
at a distance.
google
glasses
is
a
computer
with
head-mounted
display
he
1 2
1 2 3
1
1 3
1
1
1
1
2 3
wore
thick
worked
in
corporation
to
be
able
read
…
2 3
2
2
2
2
3
3
3
3
1 2 3
3. Kyunghoon Kim
>>> Original
matrix([[1, 1, 0, 1],
[7, 0, 0, 7],
[1, 1, 0, 1],
[2, 5, 3, 6]])
>>> U, Sigma, VT = np.linalg.svd(Original)
SVD with Numpy
7/17/2014 Standard Latent Semantic Indexing 3
4. Kyunghoon Kim
Singular Value Decomposition(SVD)
7/17/2014 Standard Latent Semantic Indexing 4
Harrington, Peter. Machine learning in action. Manning Publications Co., 2012.
12. Kyunghoon Kim
• Each term 𝑡𝑡𝑖𝑖 generates a row vector (𝑎𝑎𝑖𝑖 𝑖, 𝑎𝑎𝑖𝑖 𝑖, ⋯ , 𝑎𝑎𝑖𝑖 𝑖𝑖)
referred to as a term vector and each document 𝑑𝑑𝑗𝑗 generates a
column vector
𝑑𝑑𝑗𝑗 =
𝑎𝑎1𝑗𝑗
⋮
𝑎𝑎 𝑚𝑚𝑚𝑚
Frequency Matrix
7/17/2014 Standard Latent Semantic Indexing 12
13. Kyunghoon Kim
Frequency Matrix
7/17/2014 Standard Latent Semantic Indexing 13
>>> A =
np.matrix([[1,0,0],[0,1,0],[1,1
,1],[1,1,0],[0,0,1]])
>>> A
matrix([[1, 0, 0],
[0, 1, 0],
[1, 1, 1],
[1, 1, 0],
[0, 0, 1]])
14. Kyunghoon Kim
U, Sigma, VT = np.linalg.svd(A)
S = np.zeros((U.shape[1],VT.shape[0]))
S[:3,:3] = np.diag(Sigma)
Recon = U*S*VT
print np.round(Recon)
Example of SVD :: Full Singular values
7/17/2014 Standard Latent Semantic Indexing 14
[[ 1. 0. 0.]
[ 0. 1. 0.]
[ 1. 1. 1.]
[ 1. 1. 0.]
[ 0. 0. 1.]]
15. Kyunghoon Kim
Singular Value Decomposition(SVD)
7/17/2014 Standard Latent Semantic Indexing 15
Harrington, Peter. Machine learning in action. Manning Publications Co., 2012.
16. Kyunghoon Kim
U, Sigma, VT = np.linalg.svd(A)
S = np.zeros((U.shape[1],VT.shape[0]))
S[:2,:2] = np.diag(Sigma[:2])
Recon = U*S*VT
print np.round(Recon,5)
Example of SVD :: 2 singular values
7/17/2014 Standard Latent Semantic Indexing 16
[[ 0.5 0.5 0.]
[ 0.5 0.5 0.]
[ 1. 1. 1.]
[ 1. 1. 0.]
[ 0. 0. 1.]]
20. Kyunghoon Kim
Case1.
Case2.
Example with Query
7/17/2014 Standard Latent Semantic Indexing 20
matrix([[ 5.00000000e-01, 5.00000000e-01, 5.27355937e-16],
[ 5.00000000e-01, 5.00000000e-01, -1.94289029e-16],
[ 1.00000000e+00, 1.00000000e+00, 1.00000000e+00],
[ 1.00000000e+00, 1.00000000e+00, 3.33066907e-16],
[ 5.55111512e-16, -2.49800181e-16, 1.00000000e+00]])
21. Kyunghoon Kim
Case1.
Example with Query
7/17/2014 Standard Latent Semantic Indexing 21
query = np.matrix([[1,0,0,1,0]])
for i in range(int(Recon.shape[1])):
q = query
d = Recon[:,i]
dotproduct = np.asscalar(np.dot(q,d))
normq = np.linalg.norm(q)
normd = np.linalg.norm(d)
print dotproduct / (normq*normd)
22. Kyunghoon Kim
Case1.
Case2.
Example with Query
7/17/2014 Standard Latent Semantic Indexing 22
matrix([[ 5.00000000e-01, 5.00000000e-01, 5.27355937e-16],
[ 5.00000000e-01, 5.00000000e-01, -1.94289029e-16],
[ 1.00000000e+00, 1.00000000e+00, 1.00000000e+00],
[ 1.00000000e+00, 1.00000000e+00, 3.33066907e-16],
[ 5.55111512e-16, -2.49800181e-16, 1.00000000e+00]])
23. Kyunghoon Kim
What’s the feature of LSI?
7/17/2014 Standard Latent Semantic Indexing 23
Appx of A = matrix([[ 0.5, 0.5, 0. ],
[ 0.5, 0.5, 0. ],
[ 1. , 1. , 1. ],
[ 1. , 1. , 0. ],
[ 0. , 0. , 1. ]])
29. Kyunghoon Kim
• Probabilistic Latent Semantic Indexing
• Latent Dirichlet Allocation
What’s Next?
7/17/2014 Standard Latent Semantic Indexing 29
30. Kyunghoon Kim
• Harrington, Peter. Machine learning in action.
Manning Publications Co., 2012.
• Simovici, Dan A. Linear algebra tools for data
mining. World Scientific, 2012.
• Berry, Michael W., Susan T. Dumais, and Gavin
W. O'Brien. "Using linear algebra for intelligent
information retrieval." SIAM review 37.4 (1995):
573-595.
References
7/17/2014 Standard Latent Semantic Indexing 30