Feature selection for Machine Learning applied to Document Ranking (aka L2R, LtR, LETOR). Contains empirical results on Yahoo! and Bing public available Web Search Engine data.
1. Feature Selection Algorithms for
Learning to Rank
Andrea Gigli Email
Slides: http://www.slideshare.net/andrgig
March-2016
2. Outline
Machine Learning for Ranking
Proposed Feature Selection Algorithms (FSA)
and Feature Selection Protocol
Application to Public Available Web Search Data
3. Outline
Machine Learning for Ranking
Proposed Feature Selection Algorithms (FSA)
and Feature Selection Protocol
Application to Public Available Web Search Data
4. Information Retrieval and Ranking
Systems
Information Retrieval is the activity of providing
information offers relevant to an information need from
a collection of information resources.
Ranking consists in sorting the information offers
according to some criterion, so that the "best" results
appear early in the provided list.
5. Information Retrieval and Ranking
Systems
Ranking System
Information Request
(Query)
Information Offer
(Documents)
Indexed Documents
Information
Request
Processing
(Top) Ranked
Documents
6. Compute numeric scores on query/document pairs
Cosine Similarity, BM25 score, LMIR probabilities…
Use Machine Learning to build a ranking model
Learning to Rank (L2R)
How to Rank
𝑨
𝑪
𝑩 G
C
A
𝑫
Information Offers
(Documents) Ranked List
Information
Request
(Query) 𝑬
𝑮
𝑭
𝑯
7. …
…
...
...
Learning
System
Ranking
System
Indexed
Documents
...
Training
Prediction
How to Rank using Supervised
Learning
...
...
𝑑1,1 𝑑1,2
ℓ1,1
𝑑1,𝑁1
ℓ1,2 ℓ1,𝑁1
…
…
𝑑M,1 𝑑M,2
ℓ 𝑀,1
𝑑M,𝑁 𝑀
ℓ 𝑀,2 ℓ 𝑀,𝑁2
𝑞1
𝑞 𝑀 𝑓(𝑞, 𝑑)
𝑞 𝑀+1
𝑓(𝑞 𝑀+1, 𝑑 𝑁1
)
𝑓(𝑞 𝑀+1, 𝑑 𝑁 𝑀
)
𝑞𝑖: i-th query
𝑑𝑖,𝑗: j-th document
associated to the i-th
query
ℓ𝑖,𝑗: observed score
of the j-th document
associated to the i-th
query
𝑓(𝒒, 𝒐): scoring
function
10. Outline
Machine Learning for Ranking
Proposed Feature Selection Algorithms (FSA)
and Feature Selection Protocol
Application to Public Available Web Search Data
11. Query & Information Offer
Features
𝑑𝑖,1 𝑑𝑖,2 𝑑𝑖,𝑁 𝑖𝑞𝑖
… …ℓ𝑖,1 ℓ𝑖,2 ℓ𝑖,𝑁 𝑖
𝑥𝑖,1
(1)
𝑥𝑖,1
(2)
𝑥𝑖,1
(3)
⋮
𝑥𝑖,1
(𝐹)
𝑥𝑖,2
(1)
𝑥𝑖,2
(2)
𝑥𝑖,2
(3)
⋮
𝑥𝑖,2
(𝐹)
𝑥𝑖,𝑁 𝑖
(1)
𝑥𝑖,𝑁 𝑖
(2)
𝑥𝑖,𝑁 𝑖
(3)
⋮
𝑥𝑖,𝑁 𝑖
(𝐹)
…
Documents Query/Documents LabelsQuery
𝑓 𝒒, 𝒐 → 𝑓(𝒙)
F is of the order
of hundreds,
thousands
12. Which features
Case Feature examples
Web Search Query-URL matching features: number of occurrences of query
terms in the document, BM25, N-gram BM25,Tf-Idf,…
Importance of Url: PageRank , Number of in-links, Number of
clicks, Browse Rank, Spam Score, Page Quality Score…………..
Online
Advertisement
User features: last page visited, time from the last visit, last
advertisement clicked, products queried… Product features:
product description, product category, price… User-product
matching feature: tf-idf, expected rating,… Page-Product
matching feature: topic, category, tf-idf, …
Collaborative
Filtering
User features: age, gender, consumption history, … Product
characteristics: category, price, description, … Context -
Product matching: tag-matching, tf-idf, …
… …
13. How to select features in L2R
The main goal of any feature selection process is to select a
subset of n elements from a set of N measurement, with
n<N, without significantly degrading the performance of
the system
The search for the optimal subset require to search among
2N possible subsets
14. How to select features in L2R
0
1,000,000
2,000,000
3,000,000
4,000,000
5,000,000
6,000,000
7,000,000
8,000,000
9,000,000
0 5 10 15 20 25
Number of
possible
feature subsets
Number of Features
A suboptimal
criteria is
needed
15. Proposed Protocol for Comparing
Feature Selection Algorithms
Measure the
Relevance of
each Feature
Measure the
Similarity of
each pair of
features
Select a Feature
Subset using a
Feature Selector
Train the L2R
Model
Measure the L2R
Model Performance
on theTest Set
Compare
Feature
Selection
Algorithms
Repeat for different
Subset Size
1 2 3 4 5 6
Repeat from 3 for
every Feature
Selection
Algorithm
16. Competing Algorithms for feature
selection
We developed the following algorithms
Naïve Greedy searchAlgorithm for feature Selection
(NGAS)
Naïve Greedy searchAlgorithm for feature Selection -
Extended (NGAS-E)
Hierarchical Clustering search Algorithm for feature
Selection (HCAS)
17. Competing Algorithms for feature
selection
We developed the following algorithms
Naïve Greedy searchAlgorithm for feature Selection
(NGAS)
Naïve Greedy searchAlgorithm for feature Selection -
Extended (NGAS-E)
Hierarchical Clustering search Algorithm for feature
Selection (HCAS)
18. Competing Algorithm for feature
selection #1: NGAS
The undirect graph is built and the set S of
selected features is initialized.
19. Competing Algorithm for feature
selection #1: NGAS
Assuming node 1 has the highest relevance, add it
to S.
20. Competing Algorithm for feature
selection #1: NGAS
Select the node with the lowest similarity to Node
1, say Node 7, and the one with the highest
similarity to Node 7, say Node 5.
21. Competing Algorithm for feature
selection #1: NGAS
Remove Node 1. Node 5 is the one with the highest
relevance between 5 and 7, add it to S.
22. Competing Algorithm for feature
selection #1: NGAS
Select the node with the lowest similarity to Node
5, say Node 2, and the one with the highest
similarity to Node 2, say Node 3.
23. Competing Algorithm for feature
selection #1: NGAS
Remove Node 5. Assuming Node 2 is the one with
highest relevance between 2 and 3, add it to S.
24. Competing Algorithm for feature
selection #1: NGAS
Select the node with the lowest similarity to Node
2, say Node 4, and the one with the highest
similarity to Node 4, say Node 8.
25. Competing Algorithm for feature
selection #1: NGAS
Remove Node 2. Assuming Node 4 is the one with
highest relevance between 4 and 8, add it to S.
26. Competing Algorithm for feature
selection #1: NGAS
Select the node with the lowest similarity to Node
4, say Node 6, and the one with the highest
similarity to Node 6, say Node 7.
27. Competing Algorithm for feature
selection #1: NGAS
Remove Node 4. Assuming Node 6 is the one with
highest relevance between 6 and 7, add it to S.
28. Competing Algorithm for feature
selection #1: NGAS
Select the node with the lowest similarity to Node
6, say Node 3, and the one with the highest
similarity to Node 3, say Node 8.
29. Competing Algorithm for feature
selection #1: NGAS
Remove Node 6. Assuming Node 3 is the one with
highest relevance between 3 and 8, add it to S.
30. Competing Algorithm for feature
selection #1: NGAS
Select the node with the lowest similarity to Node
3, say Node 8, and the one with the highest
similarity to Node 8, say Node 7.
31. Competing Algorithm for feature
selection #1: NGAS
Remove Node 3. Assuming Node 8 is the one with
highest relevance between 8 and 7, add it to S.
33. We developed the following algorithms
Naïve Greedy searchAlgorithm for feature Selection
(NGAS)
Naïve Greedy searchAlgorithm for feature Selection -
Extended (NGAS-E)
Hierarchical Clustering search Algorithm for feature
Selection (HCAS)
Competing Algorithms for feature
selection
34. Competing Algorithm for feature
selection #2: NGAS-E (p=50%)
The undirect graph is built and the set S of selected
features is initialized.
35. Competing Algorithm for feature
selection #2: NGAS-E (p=50%)
Assuming node 1 has the highest relevance, add it
to S.
36. Competing Algorithm for feature
selection #2: NGAS-E (p=50%)
Select 7 ∗ 50% nodes less similar to 1.
37. Competing Algorithm for feature
selection #2: NGAS-E (p=50%)
Cancel Node 1 from the graph. Among the selected
nodes, add the one with highest relevance (say
node 5) to S.
38. Competing Algorithm for feature
selection #2: NGAS-E (p=50%)
Select ⌈6*50% ⌉ nodes less similar to node 5.
39. Competing Algorithm for feature
selection #2: NGAS-E (p=50%)
Cancel Node 5 from the graph. Among the selected
nodes, add the one with highest relevance (say
Node 3) to S.
40. Competing Algorithm for feature
selection #2: NGAS-E (p=50%)
Select ⌈5*50% ⌉ nodes less similar to node 3.
41. Competing Algorithm for feature
selection #2: NGAS-E (p=50%)
Cancel node 3 from the graph. Among the selected
nodes, add the one with highest relevance (say
node 4) to S.
42. Competing Algorithm for feature
selection #2: NGAS-E (p=50%)
Select ⌈4*50% ⌉ nodes less similar to node 4.
43. Competing Algorithm for feature
selection #2: NGAS-E (p=50%)
Cancel node 4 from the graph. Among the selected
nodes, add the one with highest relevance (say
node 6) to S.
44. Competing Algorithm for feature
selection #2: NGAS-E (p=50%)
Select ⌈3*50% ⌉ nodes less similar to node 6.
45. Competing Algorithm for feature
selection #2: NGAS-E (p=50%)
Cancel node 6 from the graph. Among the selected
nodes, add the one with highest relevance (say
node 2) to S.
46. Competing Algorithm for feature
selection #2: NGAS-E (p=50%)
Select ⌈2*50% ⌉nodes less similar to node 2.
47. Competing Algorithm for feature
selection #2: NGAS-E (p=50%)
Node 2 is cancelled from the graph and node 8 is
added to S.
48. Competing Algorithm for feature
selection #2: NGAS-E (p=50%)
Node 8 is cancelled from the graph and the last
node, 7, is added to S.
49. Competing Algorithm for feature
selection #2: NGAS-E (p=50%)
Node 8 is cancelled from the graph and the last
node, 7, is added to S.
50. We developed the following algorithms
Naïve Greedy searchAlgorithm for feature Selection
(NGAS)
Naïve Greedy searchAlgorithm for feature Selection -
Extended (NGAS-E)
Hierarchical Clustering search Algorithm for feature
Selection (HCAS)
Competing Algorithms for feature
selection
52. Outline
Machine Learning for Ranking
Proposed Feature Selection Algorithms (FSA)
and Feature Selection Protocol
Application to Public Available Web Search
Data
53. Application to Web Search
Engine Data
Bing Data http://research.microsoft.com/en-us/projects/mslr/
Yahoo! Data http://webscope.sandbox.yahoo.com
Train Validation Test
#queries 19,944 2,994 6,983
#urls 473,134 71,083 165,660
# features 519
Train Validation Test
#queries 18,919 6,306 6,306
#urls 723,412 235,259 241,521
# features 136
54. Proposed Protocol for Comparing
Feature Selection Algorithms
Measure the
Relevance of
each Feature
Measure the
Similarity of
each pair of
features
Select a Feature
Subset using a
Feature Selector
Train the L2R
Model
Measure the L2R
Model Performance
on theTest Set
Compare
Feature
Selection
Algorithms
Repeat for different
Subset Size
1 2 3 4 5 6
Repeat from 3 for
every Feature
Selection
Algorithm
57. Proposed Protocol for Comparing
Feature Selection Algorithms
Measure the
Relevance of
each Feature
Measure the
Similarity of
each pair of
features
Select a Feature
Subset using a
Feature Selector
Train the L2R
Model
Measure the L2R
Model Performance
on theTest Set
Compare
Feature
Selection
Algorithms
Repeat for different
Subset Size
1 2 3 4 5 6
Repeat from 3 for
every Feature
Selection
Algorithm
58. Feature Relevance
The relevance of a document is measured with a
categorical variable (0,1,2,3,4) we need to use metrics
good at measuring «dependence» between
discrete/continuous feature variables and a categorical
label variable.
In the following we use
Normalized Mutual Information (NMI):
Spearman coefficient (S)
Kendall’s tau (K)
Average GroupVariance (AGV)
OneVariable NDCG@10 (1VNDCG)
59. Feature Relevance via Normalized
Mutual Information
Mutual Information (MI) measures how much, on average, the
realization of a random variable X tells us about the realization
of the random variableY, or how much the entropy ofY, H(Y), is
reduced knowing about the realization of X
𝑀𝐼 𝑋, 𝑌 = 𝐻 𝑋 − 𝐻 𝑋 𝑌 = 𝐻 𝑋 + 𝐻 𝑌 − 𝐻 𝑋, 𝑌
The normalizad version is
𝑁𝑀𝐼 𝑋, 𝑌 =
𝑀𝐼(𝑋, 𝑌)
𝐻(𝑋) 𝐻(𝑌)
60. Feature Relevance via Spearman’s
coefficient
Spearman’s rank correlation coefficient is a non-parametric
measure of statistical dependence between two random
variables.
It is given by
𝜌 = 1 −
6 𝑑𝑖
2
𝑛(𝑛2 − 1)
where n is the sample size and
𝑑𝑖 = 𝑟𝑎𝑛𝑘 𝑥𝑖 − 𝑟𝑎𝑛𝑘 𝑦𝑖
61. Feature Relevance via Kendall’s tau
Kendall’sTau is a measure of association defined on two
ranking lists of length n. It is defined as
τ =
𝑛 𝑐 − 𝑛 𝑑
𝑛(𝑛 − 1)
2
− 𝑛1
𝑛(𝑛 − 1)
2
− 𝑛2
where 𝑛 𝑐 denotes the number of concordant pairs between the
two lists, 𝑛 𝑑 denotes the number of discordant pairs, 𝑛1 =
𝑡𝑖(𝑡𝑖 − 1)/2, 𝑛2 = 𝑢𝑗(𝑢𝑗 − 1)/2, 𝑡𝑖 is the number of tied
values in the i-th group of ties for the first list and 𝑢𝑗 is the
number of tied values in the j-th group of ties for the second list.
62. Feature Relevance via Average
GroupVariance
Average GroupVariance measure the discrimination power of a
feature.The intuitive justification is that a feature is useful if it is
capable of discriminating a small portion of the ordered scale
from the rest, and that features with a small variance are
those which satisfy this property.
𝐴𝐺𝑉 = 1 −
𝑔=1
5
𝑛 𝑔 𝑥 𝑔 − 𝑥
2
𝑖 𝑥𝑖 − 𝑥 2
where 𝑛 𝑔be the size of group g, 𝑥 𝑔 the sample mean of feature 𝑥
in the g-th group and 𝑥 whole sample mean.
63. Feature Relevance via single
feature LambdaMART scoring
For each feature i we run LambdaMART and compute
the 𝑁𝐷𝐶𝐺𝑖,𝑞@10 for each query q
The i-th feature relevance is measured averaging
𝑁𝐷𝐶𝐺𝑖,𝑞@10 over the whole query set
𝑁𝐷𝐶𝐺𝑖@10 =
1
𝑄
𝑞∈𝑄
𝑁𝐷𝐶𝐺𝑖,𝑞@10
64. Precision at k:
𝑃𝑖@𝑘 =
# 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑡𝑜𝑝 𝑘 𝑟𝑒𝑠𝑢𝑙𝑡𝑠
𝑘
Average precision:
1
𝐷
𝑘=1
𝐷
𝑃𝑖@𝑘 ∙ 𝕀 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑘 𝑖𝑠 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡
# 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠
Discounted Cumulative Gain
𝐷𝐶𝐺𝑖 =
𝑗=1
𝑘
2 𝑟𝑒𝑙 𝑖,𝑗 − 1
𝑙𝑜𝑔2 1 + 𝑟𝑎𝑛𝑘𝑗
How to Measure Ranking
Performance on query i
65. How to Measure Ranking
Performance: Normalized DCG
Document Gain
Cumulative
Gain
Document 1 31 31
Document 2 3 34
Document 3 7 41
Document 4 31 72
Discounted
31x1=31
31+3x0.63=32.9
32.9+7x0.5=36.4
36.4+31x0.4=48.8
Normalization: divide
DCG by the ideal DCG
Document Gain
Cumulative
Gain
Document 1 31 31
Document 4 31 62
Document 3 7 69
Document 2 3 72
Discounted
31x1=31
31+31x0.63=50.53
50.53+7x0.5=54.03
54.03+3x0.4=57.07
Relevance
Rating
Gain
Perfect 25-1=31
Excellent 24-1=15
Good 23-1=7
Fair 22-1=3
Bad 21-1=1
70. Proposed Protocol for Comparing
Feature Selection Algorithms
Measure the
Relevance of
each Feature
Measure the
Similarity of
each pair of
features
Select a Feature
Subset using a
Feature Selector
Train the L2R
Model
Measure the Model
Performance on
theTest Set
Compare
Feature
Selection
Algorithms
Repeat for different
Subset Size
1 2 3 4 5 6
Repeat from 3 for
every Feature
Selection
Algorithm
71. Feature Similarity
We used Spearman’s Rank coefficient for measuring
features similarity.
Spearman’s Rank is faster to be computed than NMI,
Kendall’s tau and 1VNDCG.
72. The FSA benchmark: Greedy
Algorithm for feature Selection
1. Build a complete undirected graph 𝐺0, in which
a) each node represent the i-th feature with weight 𝑤𝑖 and
b) each edge has weigth 𝑒𝑖,𝑗
2. Let 𝑆0 = ∅ be the set of selected features at step 0.
3. For i=1, …, n
a) Select the node with largest weight from 𝐺𝑖−1, suppose that it is the k-
th node
b) Punish all the nodes connected with the k-th node: 𝑤𝑗 ← 𝑤𝑗 −2*c*𝑒 𝑘,𝑗,
𝑗 ≠ 𝑘
c) Add the k-th node to 𝑆𝑖−1
d) Remove the k-th node from 𝐺𝑖−1
4. Return 𝑆 𝑛
73. Train the L2R
Model
Proposed Protocol for Comparing
Feature Selection Algorithms
Measure the
Relevance of
each Feature
Measure the
Similarity of
each pair of
features
Select a Feature
Subset using a
Feature Selector
Compare
Feature
Selection
Algorithms
Repeat for different
Subset Size
1 2 3 4 5 6
Repeat from 3 for
every Feature
Selection
Algorithm
Measure the L2R
Model Performance
on theTest Set
74. Proposed Protocol for Comparing
Feature Selection Algorithms
Measure the
Relevance of
each Feature
Measure the
Similarity of
each pair of
features
Select a Feature
Subset using a
Feature Selector
Train the L2R
Model
Measure the L2R
Model Performance
on theTest Set
Compare
Feature
Selection
Algorithms
1 2 3 4 5 6
Repeat from 3 for
every Feature
Selection
AlgorithmRepeat for different
Subset Size
75. Proposed Protocol for Comparing
Feature Selection Algorithms
Measure the
Relevance of
each Feature
Measure the
Similarity of
each pair of
features
Select a Feature
Subset using a
Feature Selector
Train the L2R
Model
Compare
Feature
Selection
Algorithms
Repeat for different
Subset Size
1 2 3 4 5 6
Repeat from 3 for
every Feature
Selection
Algorithm
Measure the Model
Performance on
theTest Set
79. Conclusions
We designed 3 FSAs and we applied them to the Web Search Pages
Ranking problem.
NGAS-E e HCAS have a performance equal or greater than the
benchmark model.
HCAS and NGAS are very
The proposed FSAs can be implemented independently of the L2R
model.
The proposed FSAs can be applied to other ML contexts, to
Sorting problems and to Model Ensambling.