Applied Machine Learning for Search Engine Relevance

Applied Machine Learning for
Search Engine Relevance
Charles H Martin, PhD

Relevance as a Linear Regression
r =X†w+e
x: (tf-idf) bag of words vector
r: relevance score (i.e. 1/-1)
w: weight vector
w = X†r/X†X
x=1
querymodel*
form X from data
(i.e. group of queries)
Solveas a numericalminimization
(i.e. iterativemethods like SOR, CG, etc )
min  X†w-r 2  w 2 : 2-norm of w
*Actually will model and predict pairwise
relations and not exact rank. ..stay tuned.
Moore-PenrosePseudoinverse

Relevance as a Linear Regression:
Tikhonov Regularization
w = (X†X)-1 X†r
Problem: inverse may be not exist (numerical instabilities,poles)
Solution: add constant a to diagonalof (X†X)-1
w = (X†X + aI)-1 X†r
a: single, adjustable
smoothingparameter
Equivalentminimization problem
min X†w-r2 + a w2
More generally: form (something like) X†X + G†G + aI,
which is a self-adjoint , bounded operator =>
min  X†w-r 2 + a Gw 2 i.e. G chosen to avoid over-fitting

The Representer Theorem Revisited:
Kernels and Greens Functions
f(x) = S aiR(x, xi) R := Kernel
Problem: to estimate a function f(x) from trainingdata (xi)
Solution: solve a general minimizationproblem
min Loss[(f(xi), yi)] + a Gx2
Machine Learning Methodsfor Estimating Operator Equations(Steinke& Scholkpf 2006)
min Loss[(f(xi), yi)] + aTKa Kij = R(xi, xj)
Equivalentto: given a Linear regularization operator ( G:H->L2(x) )
where K is an integral operator: (Kf)(y) = R(x,y)f(x)dx
so K is the Green’s Function for (GG)†, or G= (K1/2)†
in Dirac Notation: R(x,y) = <y|(GG)†|x>
f(x) = S aiR(x,xi) + S buu (xi) ; u span null space of G

Personalized Relevance Algorithms:
eSelf Personality Subspace
qpages personalitytraitsp
Cars: 0.4
User
Reading
(musicsite) Present to user
(used sports car ad)
Learned Traits:
(Likes cars 0.4)
(Sports cars 0.3)
ads
Sports cars
0.0 =>0.3
Rock-n-roll
Hard rock
Computepersonalitytraits during user visit to web site
q values = stored learned “personalitytraits”
Providerelevance rankings(for pages or ads) which includepersonalitytraits

Personalized Relevance Algorithms:
eSelf Personality Subspace
model: L [p,q] = [h, u]
where L is a square matrix
h: history (observed outputs)
p: output nodes
(observables)
Web pages, Classified Ads, …
q: hidden nodes
(not observed)
Individualized
Personality
Traits
u: user segmentation

Personalized Search:
Effective Regression Problem
[p, q](t) = (Leff[q(t-1)]) -1 • [h, u](t) on each time step (t)
PLP PLQ p = h
QLP QLQ q u
Leff = (PLP + PLQ (QLQ)-1 QLP)
Leff p = h
p = (Leff [q,u])-1 h
PLP p + PLQ q = h
QLP p +QLQ q = 0
Formal solution:
=>
Adaptson each visit, finding relevantpages p(t) based on the links L, and the
learnedpersonalitytraits (q(t-1))
Regularizationof PLP achievedwith “Green’s Function / ResolventOperator”
i.e. G†G ~= PLQ (QLQ)-1 QLP
Equivalentto Gaussian Process on a Graph, and/orBayesianLinear Regression

Related Dimensional Noise Reductions:
Rank (k) Approximations of a Matrix
LatentSemantic Analysis(LSA)
(Truncated)SingularValueDecomposition (SVD):
DiagonalizetheDensity operator D = A†A
Retaina subset of (k) eigenvalues/vectors
Equivalentrelationsfor SVD
Optimalrank(k) apprx. X s.t. min (D-X)2
2
Decomposition: A = U∑ V† A†A = V (∑† ∑) V†
PDP PDQ
QDP QDQ
*VariableLatentSemanticIndexing (Yahoo!Research Labs)
http://www.cs.cornell.edu/people/adg/html/papers/vlsi.pdf
VLSA* provides a rank (k) apprx. to any q query: min E [ qT (D-X) 2
2 ]
Cangeneralize to variousnoise models: i.e. VLSA*, PLSA**
**ProbabilisticLatentSemanticIndexing(Recommind,Inc)
http://www.cs.brown.edu/~th/papers/Hofmann-UAI99.pdf
PLSA* provides a rank (k) apprx. of classes (z) min DKL [P -P(data)]
P = U∑ V†
P(d,w) = P(d|z) P(z)P(w|z) DKL = Kullback–Leibler divergence

Personalized Relevance:
Mobile Services and Advertising
France Telecom: Last Inch Relevance Engine
time
location
playgame send msg playsong
suggest
…

KA for Comm Services
• Based on Empirical Bayesian score and Suggestion mapping table,
a decision is made to one or more possible Comm services
• Based on Business Intelligence (BI) data mining and/or Pattern
Recognition algorithms (i.e. supervised or unsupervised learning)
, we compute statisticalscores indicating who are the most likely
people to Call, send an SMS, MMS, or E-Mail.
qEvents PersonalContext
(Sun. mornings)
p
Events
Map to a contextual
comm service
Suggestions foruser
(Call, SMS, MMS, E-mail)
Learned Traits:
On Sunday morning,
mostlikely to call Mom
Comm
Services
Mom (5)
Call [who]
SMS [who]
MMS [who]
Bob (3)
Phone company(1)

i.e. Bayesian Choice Estimator
• We seek to know the probability of a "call" (choice) at a given POD.
• We "borrow information" from other PODs, assuming this is less
biased, to improve our statisticalestimate
5 days
3 PODs
3 choices
f( | 1) = 2/5
p( | 1) = (2/5)(3/15) .
(2/5)(3/15)+(2/5)(3/15)+(1/5)(11/15)
= 6/23 ~ 1/4
1
2
3
frequency estimator
Bayesianchoice estimator
Note: the Bayesianestimate is
significantly lower because we now
expect we might see a at POD 1

Incorporating Feedback
• It is not enough to simple recognize call patterns in the Event Facts—it
is also necessary to incorporate feedback into our suggestions scores
• p( c | user, pod, loc, facts, feedback ) = ?
Event Facts
Suggestions
random
irrelevant
poor
good
p( c | user, pod, facts, feedback ) =
p( c | user, pod, facts ) p ( c | user, pod, feedback)
A: Simply Factorize:
Evaluateprobabilitiesindependently,
perhaps using different Bayesianmodels

Empirical Bayesian Models
Closed form models:
Correct a sample estimate (mean m, variance ) with a
weighted average of sample + complete dataset
m = B m + (1-B) m
B shrinkage
factor
i.e:
individual
sample
user
segment
1
play game send msg playsong
Canrank order mobile services
based on estimated likelihood(m , )
1 23

Empirical Bayesian Models
What is Empirical Bayes modeling?
specify Likelihood L(y|) and Prior () distributions
estimatethe posterior () = L(y|) () L(y|) ()
 L(y|) ()d (marginal)
CombinesBayesianismand frequentism:
Approximatesmarginal using (or posterior) using point estimate(MLE), MonteCarlo, etc.
Estimatesmarginal using empirical data
Uses empirical data to infer prior, plug into likelihood to make predictions
Note: Special case of Effective OperatorRegression:
P space ~ Q space ; PLQ = I ; u  0
Q-space defines prior information

Empirical Bayesian Methods:
Poisson Gamma Model
Likelihood L(y| ) = Poisson distribution ( y e- )/y!
ConjugatePrior ( a,b) = Gamma distribution ( a-1 b a e-b a )/ G(a) ;  > 0
posterior(k) L(y|) () = (( y e- )/y!) (( a-1 b a e-b a )/ G(a) )
y+a-1 e-(1+(1/b))
also a Gamma distribution(a’,b’)
a’ = y + a ; b’ = (1+1/b)-1
Take MLE estimate of Marginal = mean (m) of the posterior (ab)
Obtaina,b from the mean (m = ab) and variance (ab2) of complete data
FinalPoint estimate E(y)= a’b’ for a sample is a weighted averageof
sample mean y=my and prior mean m
E(y) = ( my + a ) (1+1/b)-1
E(y) = (b/1+b) my + (1/1+b) m

Linear Personality Matrix
events
suggestions
actio
n
Linear(or non-linear)
Matrixtransformation: M s = a
Notice: the personality matrix may or may not mix suggestions across events, can include semantic information
andcan then solve for the prob( s ) using a computational linearsolver: s = M-1a
Over time, we can estimate the Ma,s = prob( a | s )
i.e. calls
s1 = call
s2 = sms
s3 = mms
s4 = email
i.e. for a given time and location…
count how many times we suggested a call
but the user chose an email instead
Obviously we would like M to be diagonal…or as good as possible !
Can we devise an algorithm that will learn to give "optimal" suggestions?

Matrices for Pattern Recognition
(Statistical Factor Analysis)
Call on Mon @ pod 1
Call on Mon @ pod 2
Call on Mon @ pod 3
…
…
Smson Tue @ pod 1
…
Week 1 2 3 4 5 …
We can use apply ComputationalLinearAlgebra to remove noise and find patternsin data.
CalledFactor Analysisby Statisticians,SingularValue Decomposition(SVD) by Engineers.
Implementedin Oracle Data Mining(ODM) as Non-NegativeMatrixFactorization
1. Enumerate all choices 3. Formweekly choice
density Matrix AtA
2. Count # of times a choice
is made each week
4. Weekly patterns are
collapsedintothe density
Matrix At
A
They canbe detected
using spectral analysis
(i.e. principal eigenvalues)
All weekly patterns Pure Noise
Similarto Latent (Multinomial)DirichletAlgorithm (LDA), but much simpler to implement.
Suitablewhen the number (#) of choices is not too large, and patternsare weekly.

Search Engine Relevance : Listing on
Which 5 items to list at bottom of page ?

Statistical Machine Learning:
Support Vector Machines (SVM)
From Regression to Classification: Maximum Margin Solutions
2
w2
w
Classification := Find the line
thatseparates the pointswith
the maximum margin
min ½w2
2 subject to constraints
all “above” line
all “below” line
“above” : w.xi–b >= 1 + I
“below” : w.xi –b <= 1 + i
constraint specifications:
Simple minimization (regression) becomes a convex optimization (classification)
perhaps within some slack (i.e. min ½ w2
2
+ C S I )

SVM Light: Multivariate Rank Constraints
MultivariateClassification:
min ½w2
2 +C s.t.
for all y’: wT Ψ(x,y) - wT Ψ(x,y’) >= D(y,y’) - 
let Ψ(x,y’) = S y’x be a linear fcn
x1
x2
…
xn
- 0.1
+1.2
…
-0.7
x
- 1
+1
…
-1
y
sgn
wTx
maps docs to relevance scores (+1/-1)
learn weights (w) s.t. max wT Ψ(x,y’)
is correct for training set
(within a single slack constraint )
max wT Ψ(x,y’)
D(y,y’) is a multivariateloss function:(i.e. 1- Average Precision(y,y’) )
Ψ(x,y’) is a linear discriminantfunction: (i.e. sum of ordered pairs SiSj
yij (xi -xj) )

SvmLight Ranking SVMs
SVMperf : ROC Area, F1 Score, Precision/Recall
SVMmap : Mean Average Precision ( warning: buggy ! )
SVMrank : OrdinalRegression
Stnd Classificationon pairwise differences
min ½ w2
2 + C S I,j,k s.t
for all queries qk (later, may not be query specific in SVMstruct)
doc pairs di, dj wT Ψ(qk,di) - wT Ψ(qk,dj) >= 1- I,j,k
DROCArea = 1- # swapped pairs
Enforces a directed ordering
1 2 3 4 5 6 7 8
1 0 0 0 0 1 1 0
8 7 6 5 4 3 2 1
1 2 3 4 5 6 7 8
MAP ROC Area
0.56 0.47
0.51 0.53
A Support Vector Method for Optimizing Average Precision (Joachims et. Al. 2007)

Large Scale, Linear SVMs
• Solving the Primal
– Conjugate Gradient
– Joachims: cutting plane algorithm
– Nyogi
• HandlingLarge Numbers of Constraints
• Cutting Plane Algorithm
• Open Source Implementations:
– LibSVM
– SVMLight

Search Engine Relevance : Listing on
A ranking SVM consistentlyimproves Shopping.com <click rank> by %12

Various Sparse Matrix Problems:
Google Page Rank algorithm
Ma = a rank a series of web pages by simulating user
browsing patterns (a) based on probabilistic
model (M) of page links
Pattern Recognition, Inference
L p = h estimateunknown probabilities (p) based on
historical observations (h) and probability
model (L) of links between hidden nodes
Quantum Chemistry
H  = E  compute color of dyes, pigments given empirical
information on realted molecules and/or solving
massive eigenvalue problems

Quantum Chemistry:
the electronic structure eigenproblem
Solve a massive eigenvalue problem (109-1012)
H  (, , …) =   (, , …)
H nergy Matrix
 quantumstateeigenvector
,  , … electrons
Methods can have general applicability:
Davidson method for dominanteigenvalue/ eigenvectors
Motivation for Personalization Technology
From solution of understanding the conceptualfoundations
of semi-empirical models (noiseless dimensionalreduction)
E

Relations between Quantum Mechanics and
ProbabilisticLanguage Models
• QuantumStates  resemble the states(strings, words, phrases)
in probabilistic language models (HMMs, SCFGs), except:
 is a sum* of strings of electrons:
 (, ) = 0. |  1  2  1  2 | +0.2 |  2  3  1  2 | +…
• Energy Matrix H is known exactly, but large. Models of H can be
inferred from empirical data to simplify computations.
• Energies ~= Log [Probabilities], un-normalized
*Not just a single string!

Ab initio (from first principles):
Solve entire H  (, ) =   (, ) …approximately
OR
Semi-empirical:
Assume(, ) electrons statisticallyindependent:
 (, ) = p() q()
Treat  -electrons explicitly, ignore  (hidden):
PHP p() =  p() muchsmaller problem
Parameterize PHP matrix => Heff with empirical data using a small set
of molecules, then apply to others (dyes,pigments)
Dimensional Reduction in Quantum Chemistry:
where do semi-empirical Hamiltonians come from?

Effective Hamiltonians:
Semi-Empirical Pi-Electron Methods
Heff [] p() =  p()
PHP PHQ p = E p
QHP QHQ q q
Heff [] = (PHP – PHQ (E-QHQ)-1 QHP)
PHP p + PHQ q = E p
QHP p + QHQ q = E q
=>
implicit/ hidden
Final Heff can be solved iteratively (as with eSelf Leff),
or perturbatively in various forms
Solution is formally exact =>
Dimensional Reduction / “Renormalization”

Graphical Methods
Vij = + …
DecomposeHeff into effective interactions between  electrons
(Expand (E-QHQ)-1 in an infinite series, remove E dependence)
Represent diagrammatically, ~300 diagrams to evaluate
Precompileusing symbolic manipulation:
~35 MG executable; 8-10 hours to compile
run time: 3-4 hours/parameter
+ +

Effective Hamiltonians:
Numerical Calculations
VCC
-only effective empirical
16 11.5 11-12 (eV)
Compute ab initio empirical parameters :
Can test all basic assumptions of semi-empirical theory ,
“from first principles”
Alsoprovides highly accurate eigenvalue spectra
Augmentcommercial packages (i.e. Fujitsu MOPAC) to model
spectroscopy of photoactive proteins
example

Applied Machine Learning for Search Engine Relevance

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (15)

Destaque

Destaque (16)

Semelhante a Applied Machine Learning for Search Engine Relevance

Semelhante a Applied Machine Learning for Search Engine Relevance (20)

Mais de Charles Martin

Mais de Charles Martin (20)

Último

Último (20)

Applied Machine Learning for Search Engine Relevance