Linked Data in Production: Moving Beyond Ontologies
Evaluation in Information Retrieval
1. Evaluation in Information
Retrieval
(Book chapter from C.D. Manning, P. Raghavan, and H. Schutze.
Introduction to information retrieval)
Dishant Ailawadi
INF384H / CS395T: Concepts of Information Retrieval (and Web Search) Fall11
5. Standard Test Collections
● Cranfield: 1950s in UK. Too small to be used nowadays.
TREC (text retrieval conference)
●
● Early TREC had 50 Information needs, TREC 68 provide 150
information needs over more than 500 thousand articles.
● Recent work on 25 million pages of GOV2 is now available for
research.
NTCIR EastAsian Language and Cross Language IR Systems
●
Cross Language Evaluation Forum (CLEF)
●
Reuters21578 collection most used for text classification.
●
6. Evaluation Measures
Retrieved True positives (tp) False positives (fp)
Not Retrieved False negatives (fn) True negatives (tn)
Relevant Non Relevant
Number of relevant documents retrieved = tp/(tp + fn)
recall =
Total number of relevant documents
Number of relevant documents retrieved
precision = = tp/(tp + fp)
Total number of documents retrieved
(How many correct selections?) Accuracy = (tp + tn)/(tp + fp + fn + tn)
7. An Example
n doc # relevant
Let total # of relevant docs = 6
1 588 x
Check each new recall point:
2 589 x
3 576
R=1/6=0.167; P=1/1=1
4 590 x
5 986
R=2/6=0.333; P=2/2=1
6 592 x
7 984 R=3/6=0.5; P=3/4=0.75
8 988
9 578 R=4/6=0.667; P=4/6=0.667
10 985
Missing one
11 103 relevant document.
12 591 Never reach
13 772 x R=5/6=0.833; p=5/13=0.38 100% recall
14 990
7
8. Combining Precision & Recall
FMeasure: Weighted HM of precision and recall.
Value of β controls tradeoff:
●β = 1: Equally weight precision and recall.
●β > 1: Weight recall more.
●
β < 1: Weight precision more.
2 PR 2
F= = 1 1
P + R R+P
12. Assesing Relevance
Pooling: To obtain a subset of collection related to query
●
– Use a set of search engines/algorithms
– The topk results (k is between 20 to 50 in TREC) are
merged into a pool, duplicates are removed
– Present the documents in a random order to analysts for
relevance judgments
Kappa Statistic:
●
If we have multiple judges on one information need, how consistent are
those judges?
kappa = (P(A) – P(E)) / (1 – P(E))
– P(A) is the proportion of the times that the judges
agreed
– P(E) is the proportion of the times they would be
expected to agree by chance
14. Evaluation
n doc # relevant
RPRECISION : 1 588 x
R = # of relevant docs = 7 2 589 x
3 576
RPrecision = 4/7 = 0.571 4 590 x
5 986
6 592 x
7 984
8 988
A/B Test : Precisely one change between 9 578
10 985
current and previous system. We evaluate the 11 103
Affect of that change on system. 12 591
13 772 x
14 990