The document discusses using support vector machines (SVMs) for ranking web search results, where SVMs learn weight vectors to maximize the relevance score of correct results based on training data while minimizing a multivariate loss function between item pairs. It mentions that a ranking SVM consistently improved the click rank performance on Shopping.com by a certain percentage, indicating SVMs are effective for learning document relevance in web search ranking. Large-scale linear SVMs for ranking can be solved using conjugate gradient or a cutting plane algorithm.
2. Relevance as a Linear Regression
r =X†w+e
x: (tf-idf) bag of words vector
r: relevance score (i.e. 1/-1)
w: weight vector
w = X†r/X†X
x=1
querymodel*
form X from data
(i.e. group of queries)
Solveas a numericalminimization
(i.e. iterativemethods like SOR, CG, etc )
min X†w-r 2 w 2 : 2-norm of w
*Actually will model and predict pairwise
relations and not exact rank. ..stay tuned.
Moore-PenrosePseudoinverse
3. Relevance as a Linear Regression:
Tikhonov Regularization
w = (X†X)-1 X†r
Problem: inverse may be not exist (numerical instabilities,poles)
Solution: add constant a to diagonalof (X†X)-1
w = (X†X + aI)-1 X†r
a: single, adjustable
smoothingparameter
Equivalentminimization problem
min X†w-r2 + a w2
More generally: form (something like) X†X + G†G + aI,
which is a self-adjoint , bounded operator =>
min X†w-r 2 + a Gw 2 i.e. G chosen to avoid over-fitting
4. The Representer Theorem Revisited:
Kernels and Greens Functions
f(x) = S aiR(x, xi) R := Kernel
Problem: to estimate a function f(x) from trainingdata (xi)
Solution: solve a general minimizationproblem
min Loss[(f(xi), yi)] + a Gx2
Machine Learning Methodsfor Estimating Operator Equations(Steinke& Scholkpf 2006)
min Loss[(f(xi), yi)] + aTKa Kij = R(xi, xj)
Equivalentto: given a Linear regularization operator ( G:H->L2(x) )
where K is an integral operator: (Kf)(y) = R(x,y)f(x)dx
so K is the Green’s Function for (GG)†, or G= (K1/2)†
in Dirac Notation: R(x,y) = <y|(GG)†|x>
f(x) = S aiR(x,xi) + S buu (xi) ; u span null space of G
5. Personalized Relevance Algorithms:
eSelf Personality Subspace
qpages personalitytraitsp
Cars: 0.4
User
Reading
(musicsite) Present to user
(used sports car ad)
Learned Traits:
(Likes cars 0.4)
(Sports cars 0.3)
ads
Sports cars
0.0 =>0.3
Rock-n-roll
Hard rock
Computepersonalitytraits during user visit to web site
q values = stored learned “personalitytraits”
Providerelevance rankings(for pages or ads) which includepersonalitytraits
6. Personalized Relevance Algorithms:
eSelf Personality Subspace
model: L [p,q] = [h, u]
where L is a square matrix
h: history (observed outputs)
p: output nodes
(observables)
Web pages, Classified Ads, …
q: hidden nodes
(not observed)
Individualized
Personality
Traits
u: user segmentation
7. Personalized Search:
Effective Regression Problem
[p, q](t) = (Leff[q(t-1)]) -1 • [h, u](t) on each time step (t)
PLP PLQ p = h
QLP QLQ q u
Leff = (PLP + PLQ (QLQ)-1 QLP)
Leff p = h
p = (Leff [q,u])-1 h
PLP p + PLQ q = h
QLP p +QLQ q = 0
Formal solution:
=>
Adaptson each visit, finding relevantpages p(t) based on the links L, and the
learnedpersonalitytraits (q(t-1))
Regularizationof PLP achievedwith “Green’s Function / ResolventOperator”
i.e. G†G ~= PLQ (QLQ)-1 QLP
Equivalentto Gaussian Process on a Graph, and/orBayesianLinear Regression
8. Related Dimensional Noise Reductions:
Rank (k) Approximations of a Matrix
LatentSemantic Analysis(LSA)
(Truncated)SingularValueDecomposition (SVD):
DiagonalizetheDensity operator D = A†A
Retaina subset of (k) eigenvalues/vectors
Equivalentrelationsfor SVD
Optimalrank(k) apprx. X s.t. min (D-X)2
2
Decomposition: A = U∑ V† A†A = V (∑† ∑) V†
PDP PDQ
QDP QDQ
*VariableLatentSemanticIndexing (Yahoo!Research Labs)
http://www.cs.cornell.edu/people/adg/html/papers/vlsi.pdf
VLSA* provides a rank (k) apprx. to any q query: min E [ qT (D-X) 2
2 ]
Cangeneralize to variousnoise models: i.e. VLSA*, PLSA**
**ProbabilisticLatentSemanticIndexing(Recommind,Inc)
http://www.cs.brown.edu/~th/papers/Hofmann-UAI99.pdf
PLSA* provides a rank (k) apprx. of classes (z) min DKL [P -P(data)]
P = U∑ V†
P(d,w) = P(d|z) P(z)P(w|z) DKL = Kullback–Leibler divergence
10. KA for Comm Services
• Based on Empirical Bayesian score and Suggestion mapping table,
a decision is made to one or more possible Comm services
• Based on Business Intelligence (BI) data mining and/or Pattern
Recognition algorithms (i.e. supervised or unsupervised learning)
, we compute statisticalscores indicating who are the most likely
people to Call, send an SMS, MMS, or E-Mail.
qEvents PersonalContext
(Sun. mornings)
p
Events
Map to a contextual
comm service
Suggestions foruser
(Call, SMS, MMS, E-mail)
Learned Traits:
On Sunday morning,
mostlikely to call Mom
Comm
Services
Mom (5)
Call [who]
SMS [who]
MMS [who]
Bob (3)
Phone company(1)
12. Bayesian Score Estimation
To estimate p(call|POD)
frequency:
p(call|POD) = # of times user called
someone at that POD
Bayesian:
p(call|POD) = p(POD|call)p(call)
Sq p(POD|q)p(q)
where q = call, sms, mms, or email
13. i.e. Bayesian Choice Estimator
• We seek to know the probability of a "call" (choice) at a given POD.
• We "borrow information" from other PODs, assuming this is less
biased, to improve our statisticalestimate
5 days
3 PODs
3 choices
f( | 1) = 2/5
p( | 1) = (2/5)(3/15) .
(2/5)(3/15)+(2/5)(3/15)+(1/5)(11/15)
= 6/23 ~ 1/4
1
2
3
frequency estimator
Bayesianchoice estimator
Note: the Bayesianestimate is
significantly lower because we now
expect we might see a at POD 1
14. Incorporating Feedback
• It is not enough to simple recognize call patterns in the Event Facts—it
is also necessary to incorporate feedback into our suggestions scores
• p( c | user, pod, loc, facts, feedback ) = ?
Event Facts
Suggestions
random
irrelevant
poor
good
p( c | user, pod, facts, feedback ) =
p( c | user, pod, facts ) p ( c | user, pod, feedback)
A: Simply Factorize:
Evaluateprobabilitiesindependently,
perhaps using different Bayesianmodels
15. Personalized Relevance:
Empirical Bayesian Models
Closed form models:
Correct a sample estimate (mean m, variance ) with a
weighted average of sample + complete dataset
m = B m + (1-B) m
B shrinkage
factor
i.e:
individual
sample
user
segment
1
play game send msg playsong
Canrank order mobile services
based on estimated likelihood(m , )
1 23
16. Personalized Relevance:
Empirical Bayesian Models
What is Empirical Bayes modeling?
specify Likelihood L(y|) and Prior () distributions
estimatethe posterior () = L(y|) () L(y|) ()
L(y|) ()d (marginal)
CombinesBayesianismand frequentism:
Approximatesmarginal using (or posterior) using point estimate(MLE), MonteCarlo, etc.
Estimatesmarginal using empirical data
Uses empirical data to infer prior, plug into likelihood to make predictions
Note: Special case of Effective OperatorRegression:
P space ~ Q space ; PLQ = I ; u 0
Q-space defines prior information
17. Empirical Bayesian Methods:
Poisson Gamma Model
Likelihood L(y| ) = Poisson distribution ( y e- )/y!
ConjugatePrior ( a,b) = Gamma distribution ( a-1 b a e-b a )/ G(a) ; > 0
posterior(k) L(y|) () = (( y e- )/y!) (( a-1 b a e-b a )/ G(a) )
y+a-1 e-(1+(1/b))
also a Gamma distribution(a’,b’)
a’ = y + a ; b’ = (1+1/b)-1
Take MLE estimate of Marginal = mean (m) of the posterior (ab)
Obtaina,b from the mean (m = ab) and variance (ab2) of complete data
FinalPoint estimate E(y)= a’b’ for a sample is a weighted averageof
sample mean y=my and prior mean m
E(y) = ( my + a ) (1+1/b)-1
E(y) = (b/1+b) my + (1/1+b) m
18. Linear Personality Matrix
events
suggestions
actio
n
Linear(or non-linear)
Matrixtransformation: M s = a
Notice: the personality matrix may or may not mix suggestions across events, can include semantic information
andcan then solve for the prob( s ) using a computational linearsolver: s = M-1a
Over time, we can estimate the Ma,s = prob( a | s )
i.e. calls
s1 = call
s2 = sms
s3 = mms
s4 = email
i.e. for a given time and location…
count how many times we suggested a call
but the user chose an email instead
Obviously we would like M to be diagonal…or as good as possible !
Can we devise an algorithm that will learn to give "optimal" suggestions?
19. Matrices for Pattern Recognition
(Statistical Factor Analysis)
Call on Mon @ pod 1
Call on Mon @ pod 2
Call on Mon @ pod 3
…
…
Smson Tue @ pod 1
…
Week 1 2 3 4 5 …
We can use apply ComputationalLinearAlgebra to remove noise and find patternsin data.
CalledFactor Analysisby Statisticians,SingularValue Decomposition(SVD) by Engineers.
Implementedin Oracle Data Mining(ODM) as Non-NegativeMatrixFactorization
1. Enumerate all choices 3. Formweekly choice
density Matrix AtA
2. Count # of times a choice
is made each week
4. Weekly patterns are
collapsedintothe density
Matrix At
A
They canbe detected
using spectral analysis
(i.e. principal eigenvalues)
All weekly patterns Pure Noise
Similarto Latent (Multinomial)DirichletAlgorithm (LDA), but much simpler to implement.
Suitablewhen the number (#) of choices is not too large, and patternsare weekly.
21. Statistical Machine Learning:
Support Vector Machines (SVM)
From Regression to Classification: Maximum Margin Solutions
2
w2
w
Classification := Find the line
thatseparates the pointswith
the maximum margin
min ½w2
2 subject to constraints
all “above” line
all “below” line
“above” : w.xi–b >= 1 + I
“below” : w.xi –b <= 1 + i
constraint specifications:
Simple minimization (regression) becomes a convex optimization (classification)
perhaps within some slack (i.e. min ½ w2
2
+ C S I )
22. SVM Light: Multivariate Rank Constraints
MultivariateClassification:
min ½w2
2 +C s.t.
for all y’: wT Ψ(x,y) - wT Ψ(x,y’) >= D(y,y’) -
let Ψ(x,y’) = S y’x be a linear fcn
x1
x2
…
xn
- 0.1
+1.2
…
-0.7
x
- 1
+1
…
-1
y
sgn
wTx
maps docs to relevance scores (+1/-1)
learn weights (w) s.t. max wT Ψ(x,y’)
is correct for training set
(within a single slack constraint )
max wT Ψ(x,y’)
D(y,y’) is a multivariateloss function:(i.e. 1- Average Precision(y,y’) )
Ψ(x,y’) is a linear discriminantfunction: (i.e. sum of ordered pairs SiSj
yij (xi -xj) )
23. SvmLight Ranking SVMs
SVMperf : ROC Area, F1 Score, Precision/Recall
SVMmap : Mean Average Precision ( warning: buggy ! )
SVMrank : OrdinalRegression
Stnd Classificationon pairwise differences
min ½ w2
2 + C S I,j,k s.t
for all queries qk (later, may not be query specific in SVMstruct)
doc pairs di, dj wT Ψ(qk,di) - wT Ψ(qk,dj) >= 1- I,j,k
DROCArea = 1- # swapped pairs
Enforces a directed ordering
1 2 3 4 5 6 7 8
1 0 0 0 0 1 1 0
8 7 6 5 4 3 2 1
1 2 3 4 5 6 7 8
MAP ROC Area
0.56 0.47
0.51 0.53
A Support Vector Method for Optimizing Average Precision (Joachims et. Al. 2007)
24. Large Scale, Linear SVMs
• Solving the Primal
– Conjugate Gradient
– Joachims: cutting plane algorithm
– Nyogi
• HandlingLarge Numbers of Constraints
• Cutting Plane Algorithm
• Open Source Implementations:
– LibSVM
– SVMLight
25. Search Engine Relevance : Listing on
A ranking SVM consistentlyimproves Shopping.com <click rank> by %12
26. Various Sparse Matrix Problems:
Google Page Rank algorithm
Ma = a rank a series of web pages by simulating user
browsing patterns (a) based on probabilistic
model (M) of page links
Pattern Recognition, Inference
L p = h estimateunknown probabilities (p) based on
historical observations (h) and probability
model (L) of links between hidden nodes
Quantum Chemistry
H = E compute color of dyes, pigments given empirical
information on realted molecules and/or solving
massive eigenvalue problems
27. Quantum Chemistry:
the electronic structure eigenproblem
Solve a massive eigenvalue problem (109-1012)
H (, , …) = (, , …)
H nergy Matrix
quantumstateeigenvector
, , … electrons
Methods can have general applicability:
Davidson method for dominanteigenvalue/ eigenvectors
Motivation for Personalization Technology
From solution of understanding the conceptualfoundations
of semi-empirical models (noiseless dimensionalreduction)
E
28. Relations between Quantum Mechanics and
ProbabilisticLanguage Models
• QuantumStates resemble the states(strings, words, phrases)
in probabilistic language models (HMMs, SCFGs), except:
is a sum* of strings of electrons:
(, ) = 0. | 1 2 1 2 | +0.2 | 2 3 1 2 | +…
• Energy Matrix H is known exactly, but large. Models of H can be
inferred from empirical data to simplify computations.
• Energies ~= Log [Probabilities], un-normalized
*Not just a single string!
29. Ab initio (from first principles):
Solve entire H (, ) = (, ) …approximately
OR
Semi-empirical:
Assume(, ) electrons statisticallyindependent:
(, ) = p() q()
Treat -electrons explicitly, ignore (hidden):
PHP p() = p() muchsmaller problem
Parameterize PHP matrix => Heff with empirical data using a small set
of molecules, then apply to others (dyes,pigments)
Dimensional Reduction in Quantum Chemistry:
where do semi-empirical Hamiltonians come from?
30. Effective Hamiltonians:
Semi-Empirical Pi-Electron Methods
Heff [] p() = p()
PHP PHQ p = E p
QHP QHQ q q
Heff [] = (PHP – PHQ (E-QHQ)-1 QHP)
PHP p + PHQ q = E p
QHP p + QHQ q = E q
=>
implicit/ hidden
Final Heff can be solved iteratively (as with eSelf Leff),
or perturbatively in various forms
Solution is formally exact =>
Dimensional Reduction / “Renormalization”
31. Graphical Methods
Vij = + …
DecomposeHeff into effective interactions between electrons
(Expand (E-QHQ)-1 in an infinite series, remove E dependence)
Represent diagrammatically, ~300 diagrams to evaluate
Precompileusing symbolic manipulation:
~35 MG executable; 8-10 hours to compile
run time: 3-4 hours/parameter
+ +
32. Effective Hamiltonians:
Numerical Calculations
VCC
-only effective empirical
16 11.5 11-12 (eV)
Compute ab initio empirical parameters :
Can test all basic assumptions of semi-empirical theory ,
“from first principles”
Alsoprovides highly accurate eigenvalue spectra
Augmentcommercial packages (i.e. Fujitsu MOPAC) to model
spectroscopy of photoactive proteins
example