SlideShare uma empresa Scribd logo
1 de 101
Chapter 3 Retrieval Evaluation
Hsin-Hsi Chen
Department of Computer Science and Information Engineering
National Taiwan University

Hsin-Hsi Chen

3-1
Evaluation
• Function analysis
• Time and space
– The shorter the response time, the smaller the space
used, the better the system is

• Performance evaluation (for data retrieval)
–
–
–
–

Performance of the indexing structure
The interaction with the operating systems
The delays in communication channels
The overheads introduced by software layers

• Performance evaluation (for information retrieval)
– Besides time and space, retrieval performance is an
issue
Hsin-Hsi Chen

3-2
Retrieval Performance Evaluation
• Retrieval task
– Batch mode
• The user submits a query and receives an answer back
• How the answer set is generated

– Interactive mode
• The user specifies his information need through a series of
interactive steps with the system
• Aspects
–
–
–
–

User effort
characteristics of interface design
guidance provided by the system
duration of the session

Hsin-Hsi Chen

3-3
Recall and Precision
• Recall

| Ra |
|R|

– the fraction of the relevant documents which
has been retrieved

• Precision

| Ra |
| A|

– the fraction of the retrieved documents which is
relevant
Relevant Docs
in Answer Set
|Ra|
Relevant Docs
Hsin-Hsi Chen |R|

collection

Answer Set
|A|
3-4
precision versus recall curve
• The user is not usually presented with all the
documents in the answer set A at once
• Example
Rq={d3,d5,d9,d25,d39,d44,d56,d71,d89,d123}

(100%,10%)

(precision, recall)

Ranking for query q by a retrieval algorithm
1. d123 •
6. d9 •
11. d38
2. d84
7. d511
12. d48
3. d56 •
8. d129
13. d250
4. d6
9. d187
14. d113
5. d8
10. d25 • 15. d3 •
(66%,20%)

Hsin-Hsi Chen

(50%,30%)

(40%,40%)

(33%,50%)

3-5
11 standard recall levels
for a query
• precision versus recall based on 11 standard
recall levels: 0%, 10%, 20%, …, 100%
p
r
e
c
i
s
i
o
n

interpolation
120
100
80
60
40
20

0
20
Hsin-Hsi Chen

40

60
recall

80

100

120
3-6
11 standard recall levels
for several queries
• average the precision figures at each recall
level
Nq
Pi (r )
P(r ) = ∑
i =1 Nq

• P(r): the average precision at the recall level r
• Nq: the number of queries used
• Pi(r): the precision at recall level r for the i-th
query
Hsin-Hsi Chen

3-7
necessity of
interpolation procedure
• Rq={d3,d56,d129}
1. d123
2. d84
3. d56 •
4. d6
5. d8
(33.3%,33.3%)

(precision, recall)

6. d9
7. d511
8. d129 •
9. d187
10. d25

(25%,66.6%)

11. d38
12. d48
13. d250
14. d113
15. d3 •
(20%,100%)

How about the precision figures at the recall levels 0, 0.1, 0.2, 0.3, …, 1?

Hsin-Hsi Chen

3-8
interpolation procedure
• rj (j ∈ {0,1,2,…,10}): a reference to the j-th standard
recall level (e.g., r5 references to the recall level
50%)
d56 • (33.3%,33.3%)
• P(rj)=max rj≤r≤rj+1P(r)
d129 • (25%,66.6%)
d3 • (20%,100%)
• Example
r0: (33.33%,0%)
r3: (33.33%,30%)
r6: (25%,60%)
r9: (20%,90%)
Hsin-Hsi Chen

r1: (33.33%,10%)
r4: (25%,40%)
r7: (20%,70%)
r10: (20%,100%)

r2: (33.33%,20%)
r5: (25%,50%)
r8: (20%,80%)

interpolated precision

3-9
Precision versus recall figures
compare the retrieval performance of distinct retrieve
algorithms over a set of example queries

• The curve of precision versus recall which results
from averaging the results for various queries
100
p 90
80
r
e 70
c 60
i 50
s 40
i 30
o
20
n
10
0

20

Hsin-Hsi Chen

40

60
recall

80

100

3-10

120
Average Precision at given
Document Cutoff Values
• Compute the average precision when 5, 10,
15, 20, 30, 50 or 100 relevant documents
have been seen.
• Provide additional information on the
retrieval performance of the ranking
algorithm

Hsin-Hsi Chen

3-11
Single Value Summaries
compare the retrieval performance of a retrieval algorithm for
individual queries

• Average precision at seen relevant documents
– Generate a single value summary of the ranking by
averaging the precision figures obtained after each new
relevant document is observed
– Example
1. d123 •(1)
6. d9 • (0.5) 11. d38
2. d84
7. d511
12. d48
3. d56 • (0.66) 8. d129
13. d250
4. d6
9. d187
14. d113
5. d8
10. d25 • (0.4)15. d3 • (0.33)
(1+0.66+0.5+0.4+0.33)/5=0.57
Favor
Hsin-Hsi Chen systems which retrieve relevant documents quickly
3-12
Single Value Summaries
(Continued)

• Reciprocal Rank (RR)
– Equals to precision at the 1st retrieved relevant
document
– Useful for tasks need only 1 relevant document
ex: Question & Answering

• Mean Reciprocal Rank (MRR)
– The mean of RR over several queries
Hsin-Hsi Chen

3-13
Single Value Summaries
(Continued)

• R-Precision
– Generate a single value summary of ranking by
computing the precision at the R-th position in the
ranking, where R is the total number of relevant
documents for the current query
1. d123 •
6. d9 •
2. d84
7. d511
3. d56 •
8. d129
4. d6
9. d187
5. d8
10. d25 •
R=10 and # relevant=4
R-precision=4/10=0.4

Hsin-Hsi Chen

2.

1.
2.

d123
d84

3.

56 •

R=3 and # relevant=1
R-precision=1/3=0.33

3-14
Single Value Summaries
(Continued)

• Precision Histograms
– A R-precision graph for several queries
– Compare the retrieval history of two algorithms
RPA / B (i ) = RPA (i ) − RPB (i )
where RPA (i ) and RPB (i ) are R − precision values of
retrieval a lg orithms A and B for the i − th query

– RPA/B=0: both algorithms have equivalent performance
for the i-the query
– RPA/B>0: A has better retrieval performance for query i
– RPA/B<0: B has better retrieval performance for query i
Hsin-Hsi Chen

3-15
Single Value Summaries
(Continued)
1.5

8

1.0
0.5
0.0

1

2

3

4

5

6

7

8

9

-0.5
-1.0
-1.5

Hsin-Hsi Chen

2
Query Number

3-16

10
Summary Table Statistics
• Statistical summary regarding the set of all the
queries in a retrieval task
– the number of queries used in the task
– the total number of documents retrieved by all queries
– the total number of relevant documents which were
effectively retrieved when all queries are considered
– the total number of relevant documents which could
have been retrieved by all queries
– …
Hsin-Hsi Chen

3-17
Precision and Recall
Appropriateness
• Estimation of maximal recall requires knowledge
of all the documents in the collection
• Recall and precision capture different aspects of
the set of retrieved documents
• Recall and precision measure the effectiveness
over queries in batch mode
• Recall and precision are defined under the
enforcement of linear ordering of the retrieved
documents
Hsin-Hsi Chen

3-18
The Harmonic Mean
• harmonic mean F(j) of recall and precision
F ( j) =

2
1
1
+
R( j ) P( j )

• R(j): the recall for the j-th document in the ranking
• P(j): the precision for the j-th document in the
ranking
2× P × R
F=
P+R
Hsin-Hsi Chen

3-19
Example
1. d123
2. d84
3. d56 •
4. d6
5. d8
(33.3%,33.3%)
F (3) =

2
1
1
+
0.33 0.33

Hsin-Hsi Chen

6. d9
7. d511
8. d129 •
9. d187
10. d25
(25%,66.6%)

= 0.33 F (8) =

11. d38
12. d48
13. d250
14. d113
15. d3 •
(20%,100%)

2
1
1
+
0.25 0.67

= 0.36 F (15) =

2
1
1
+
0.20 1

3-20

= 0.33
The E Measure
• E evaluation measure
– Allow the user to specify whether he is more
interested in recall or precision
E( j) = 1 −

1 + b2
b2
1
+
R( j ) P( j )

( β + 1) × P × R
F=
2
β ×P+R
2

Hsin-Hsi Chen

3-21
User-oriented measures
• Basic assumption of previous evaluation
– The set of relevant documents for a query is the
same, independent of the user

• User-oriented measures
–
–
–
–

coverage ratio
novelty ratio
relative recall
recall effort

Hsin-Hsi Chen

3-22
| Rk |
cov erage =
|U |

high coverage ratio: system finds most of the relevant
documents the user expected to see

| Ru |
high novelty ratio: the system reveals many new
novelty =
relevant documents which were
| Ru | + | Rk |
previously unknown

Relevant Docs |R|

relative recall=
| Rk | + | Ru |
|U |

Answer Set |A| (proposed by system)

recall effort:

Relevant Docs
known to the user |U|
Relevant Docs
known to the User
which were retrieved |Rk|

Hsin-Hsi Chen

# of relevant docs
the user expected
to find/# of docs
examined to find
the expected
relevant docs
Relevant Docs
previously unknown to the
user which were retrieved |Ru|

3-23
A More Modern Relevance
Metric for Web Search
• Normalized Discounted Cumulated Gain (NDCG)
– K. Jaervelin and J. Kekaelaeinen (TOIS 2002)
– Gain: relevance of a document is no more binary
– Sensitive to the position of highest rated
documents
• Log-discounting of gains according to the positions

– Normalize the DCG with the “ideal set” DCG.

Hsin-Hsi Chen

3-24
NDCG Example
• Assume that the relevance scores 0 – 3 are used.
G’=<3, 2, 3, 0, 0, 1, 2, 2, 3, 0, …>

• Cumulated Gain (CG)
G[1], if i = 1
CG[i ] = 
CG[i − 1] + G[i ], otherwise

CG’=<3, 5, 8, 8, 8, 9, 11, 13, 16, 16, …>

Hsin-Hsi Chen

3-25
NDCG Example
(Continued)

• Discounted Cumulated Gain (DCG)
G[1], if i = 1
DCG[i ] = 
 DCG[i − 1] + G[i ] log b i , otherwise
let b=2,
DCG’=<3, 5, 6.89, 6.89, 6.89, 7.28, 7.99, 8.66, 9.61, 9.61, …>

• Normalized Discounted Cumulated Gain (NDCG)
Ideal vector I’=<3, 3, 3, 2, 2, 2, 1, 1, 1, 1, 0, 0, 0, …>
CGI’=<3, 6, 9, 11, 13, 15, 16, 17, 18, 19, 19, 19, 19, …>
DCGI’=<3, 6, 7.89, 8.89, 9.75, 10.52, 10.88, 11.21, 11.53, 11.83, 1.83, …>
 NDCG’=<1, 0.83, 0.89, 0.73, 0.62, 0.6, 0.69, 0.76, 0.89, 0.84, …>

Hsin-Hsi Chen

3-26
測試集 (Test Collections)
• 組成要素

– 文件集 (Document Set; Document Collection)
– 查詢問題 (Query; Topic)
– 相關判斷 (Relevant Judgment)

• 用途

– 設計與發展 : 系統測試
– 評估 : 系統效能 (Effectiveness) 之測量
– 比較 : 不同系統與不同技術間之比較

• 評比

– 根據不同的目的而有不同的評比項目
– 量化的測量準則,如 Precision 與 Recall

Hsin-Hsi Chen

3-27
測試集 (Test Collections) ( 續 )
• 小型測試集
– 早期 : Cranfield
– 英文 : SMART Collections, OHSUMED, Cystic Fibrosis,
LISA….
– 日文 : BMIR-J2

• 大型評比環境 : 提供測試集及研討的論壇
– 美國 : TREC
– 日本 : NTCIR, IREX
– 歐洲 : AMARYLLIS, CLEF
Hsin-Hsi Chen

3-28
各測試集之基本資料

測試集

Cranfield II
ADI

文件數

1,400
82

MEDLARS
TIME

1,033
423

相關判
文件集
斷層次
平均相關
查詢
大小 平均字數
平均字數 文件數 主題領域
不 語文
(MB) /文件 問題數 /查詢問題 /查詢問題
相
相
關
關
1.6
53.1
225
9.2
7.2
4 1
太空動力學
英文
0.04
27.1
35
14.6
9.5
N/A
文獻學
英文
1.1
51.6
30
10.1
23.2
2 2
醫學
英文
1.5
570
24
16.0
8.7
N/A
世界情勢
英文
2.2
24.5
64
10.8
15.3 A C M
N/A
通訊
英文
2.2
46.5
112
28.3
49.8
N/A
資訊科學
英文

CACM

3,204

CISI

1,460

NPL

11,429

3.1

20.0

100

7.2

22.4

電 子 、 電 腦 、 N/A
英文
物理、地理

INSPEC

12,684

N/A

32.5

84

15.6

33.0

800

N/A

N/A

63

N/A

8.4

UKCIS

27,361

N/A

182

193

N/A

57

UKAEA

12,765

N/A

N/A

60

N/A

N/A

LISA

6,004

3.4

N/A

35

N/A

10.8

物理、電 2 1
子、控制
1 1
文獻學
2 2
生化
2 1
核子科學
N/A
N/A

Cystic
Fibrosis

1,239

N/A

49.7

100

6.8

6.4-31.9

醫學

6 1

N/A

250

101

10

17/19.4

N/A

2 1

5,080

N/A

621.8

60

102.2

10.6/28.4

英文

TREC
1,754,896 ~5GB 481.6
(TREC-1~6)

經 濟 、工 程

2 1

350

105.8

185.3

多主題

1 1

英文

多主題

N/A

近期測試集:
(1) 多主題全文及詳細的查詢
問題
(2) 大規模

英文

348,566

早期測試集:
(1) 簡短書目資料,如題名
,摘要,關鍵詞等組成
(2) 專門主題領域

法文

ISILT

OSHUMED
BMIR-J2

AMARYLLIS

336,000

201

N/A

56

N/A

N/A

NTCIR

300,000

N/A

N/A

100

N/A

N/A

N/A

N/A

N/A

N/A

N/A

N/A

IREX

Hsin-Hsi Chen

多主題
多主題

2 1
2 1

英文
英文
英文
英文
英文

日文

日文
日文

3-29
Cranfield II
(ftp://ftp.cs.cornell.edu/pub/smart/cran/)

• 比較 33 種不同索引方式之檢索效益
• 蒐集 1400 篇有關太空動力學的文件 ( 摘
要形式 ) ,請每位作者根據這些文件與其
當時研究的主題提出問題,經篩選後產
生 200 多個查詢問題
.I 001
.W
what similarity laws must be obeyed when constructing
aeroelastic models of heated high speed aircraft?
Hsin-Hsi Chen
3-30
Cranfield II (Continued)
• Cranfield II 測試集中相關判斷建立四個步驟
找
可
能
相
關
的
文
件
驗
證

– 首先請提出查詢問題的建構者,對文件後所附之引
用及參考文獻進行,相關判斷
– 接著請五位該領域的研究生,將 查詢問題與每篇文
件逐一檢視,共花了 1500 小時,進行了 50 萬次以
上的相關判斷,希望能找出所有的相關文件。
– 為了避免前述過程仍有遺漏,又利用文獻耦合的概
念,計算文件間之相關性。發掘更多可能的相關文
件。若有兩篇以上的文獻,共同引用了一篇或多篇
論文,則稱這些文獻間具有耦合關係。
– 最後,將以上找出的所有文件,再一併送回給原作
者進行判斷。

Hsin-Hsi Chen

3-31
TREC ~簡介
• TREC: Text REtrieval Conference
• 主辦 : NIST 及 DARPA ,為 TIPSTER 文件計劃之子計
劃之一
• Leader: Donna Harman (Manager of The Natural Language
Processing and Information Retrieval Group of the
Information Access and User Interfaces Division, NIST)
• 文件集
– 5GB 以上
– 數百萬篇文件

Hsin-Hsi Chen

3-32
History
•
•
•
•

TREC-1 (Text Retrieval Conference) Nov 1992
TECC-2 Aug 1993
TREC-3
TREC-7
January 16, 1998 -- submit application to NIST.
Beginning February 2 -- document disks distributed to those new
participants who have returned the required forms.
June 1 -- 50 new test topics for ad hoc task distributed
August 3 -- ad hoc results due at NIST
September 1 -- latest track submission deadline.
September 4 -- speaker proposals due at NIST.
October 1 -- relevance judgments and individual evaluation
scores due back to participants
Nov. 9-11-- TREC-7 conference at NIST in Gaithersburg, Md.
TREC-8 (1999) TREC-9 (2000) TREC-10 (2001) …
Hsin-Hsi Chen
3-33
The Test Collection
• the documents
• the example information requests (called
topics in TREC)
• the relevant judgments (right answers)

Hsin-Hsi Chen

3-34
The Documents
• Disk 1 (1GB)
–
–
–
–
–

WSJ: Wall Street Journal (1987, 1988, 1989) 華爾街日報
AP: AP Newswire (1989) 美聯社
ZIFF: Articles from Computer Select disks (Ziff-Davis Publishing)
FR: Federal Register (1989) 美國聯邦政府公報
DOE: Short abstracts from DOE publications

• Disk2 (1GB)
–
–
–
–

WSJ: Wall Street Journal (1990, 1991, 1992)
AP: AP Newswire (1988)
ZIFF: Articles from Computer Select disks
FR: Federal Register (1988)

Hsin-Hsi Chen

3-35
The Documents (Continued)
• Disk 3 (1 GB)
–
–
–
–

SJMN: San Jose Mercury News (1991) 聖荷西水星報
AP: AP Newswire (1990)
ZIFF: Articles from Computer Select disks
PAT: U.S. Patents (1993)

• Statistics
– document lengths
DOE (very short documents) vs. FR (very long documents)
– range of document lengths
AP (similar in length) vs. WSJ and ZIFF (wider range of lengths)

Hsin-Hsi Chen

3-36
TREC 文件集
Volume

1

2

3

4

5
Routing
Test
Data

DOE (very short documents) vs. FR (very long documents)
AP (similar in length) vs. WSJ and ZIFF (wider range of lengths)

Revised

Sources

Wall Street Journal, 1978-1989
Associated Press newswire, 1989
March
Computer Selects Articles, Ziff-Davis
1994
Federal Register, 1989
Abstracts of U.S. DOE publications
Wall Street Journal, 1990-1992(WSJ)
March Associated Press newswire(1988)(AP)
1994
Computer Selects articles, Ziff-Davis(ZIFF)
Federal Register(1988)(FR88)
San Jose Mercury News, 1991
March Associated Press newswire, 1990
1994
Computer Selects articles, Ziff-Davis
U.S. patents, 1993
The Financial Times, 1991-1994(FT)
May 1996 Federal Register, 1994(FR94)
Congressional Record, 1993(CR)
April Foreign Broadcast Information Service(FBIS)
1997
Los Angeles Times (1989, 1990)
Foreign Broadcast Information Service(FBIS)

Hsin-Hsi Chen

Size
(M B )

Docs

267
254
242
260
184
242
237
175
209
287
237
345
243
564
395
235
470
475

98,732
84,678
75,180
25,960
226,087
74,520
79,919
56,920
19,860
90,257
78,321
161,021
6,711
210,158
55,630
27,922
130,471
131,896

245
446
200
391
111
301
438
182
396
379
451
122
4445
316
588
288
322
351

434.0
473.9
473.0
1315.9
120.4
508.4
468.7
451.9
1378.1
453.0
478.4
295.4
5391.0
412.7
644.7
1373.5
543.6
526.5

490 120,653

348

581.3

Median #

Mean #

Terms/Doc Terms/Doc

3-37
Document Format
(in Standard Generalized Mark-up Language, SGML)
<DOC>
<DOCNO>WSJ880406-0090</DOCNO>
<HL>AT&T Unveils Services to Upgrade Phone Networks Under Global Plan </HL>
<AUTHOR>Janet Guyon (WSJ staff) </AUTHOR>
<DATELINE>New York</DATELINE>
<TEXT>
American Telephone & Telegraph Co. introduced the first of a new generation of
phone services with broad implications for computer and communications
.
.
</TEXT>
</DOC>

Hsin-Hsi Chen

3-38
TREC 之文件標示
<DOC>
<DOCN0>FT911-3</DOCN0>
<PROFILE>AN-BE0A7AAIFT</PROFILE>
<DATE>910514
</DATE>
<HEADLINE>
FT 14 MAY 91 / International Company News: Contigas plans DM900m east German project
</HEADLINE>
<BYLINE>
By DAVID GOODHART
</BYLINE>
<DATELINE>
BONN
</DATELINE>
<TEXT>
CONTIGAS, the German gas group 81 per cent owned by the utility Bayernwerk, said yesterday that it intends to
invest DM900m (Dollars 522m) in the next jour years to build a new gas distribution system in the east German state of
Thuringia. …
</TEXT>

</DOC>

Hsin-Hsi Chen

3-39
The Topics
• Issue 1
– allow a wide range of query construction methods
– keep the topic (user need) distinct from the query (the
actual text submitted to the system)

• Issue 2
– increase the amount of information available about
each topic
– include with each topic a clear statement of what
criteria make a document relevant

• TREC
– 50 topics/year, 400 topics (TREC1~TREC7)
Hsin-Hsi Chen

3-40
Sample Topics used in TREC-1 and TREC-2
<top>
<head>Tipster Topic Description
<num>Number: 066
<dom>Domain: Science and Technology
<title>Topic: Natural Language Processing
<desc>Description: (one sentence description)
Document will identify a type of natural language processing technology which
is being developed or marketed in the U.S.
<narr>Narrative: (complete description of document relevance for assessors)
A relevant document will identify a company or institution developing or
marketing a natural language processing technology, identify the technology,
and identify one or more features of the company’s product.
<con>Concepts: (a mini-knowledge base about topic such as a real searcher
1. natural language processing
might possess)
2. translation, language, dictionary, font
3. software applications

Hsin-Hsi Chen

3-41
<fac> Factors (allow easier automatic query building by listing specific
<nat> Nationality: U.S.
items from the narrative that
</fact>
constraint the documents that
<def>Definition(s):
are relevant)
</top>

Hsin-Hsi Chen

3-42
TREC-1 and TREC-2 查詢主題
<top>
<head> Tipster Topic Description
<num> Number: 037
<dom> Domain: Science and Technology
<title> Topic: Identify SAA components
<desc> Description:
Document identifies software products which adhere to IBM's SAA standards.
<narr> Narrative:
To be relevant, a document must identify a piece of software which is considered a Systems Application Architectural
(SAA) component or one which conforms to SAA.
<con> Concept(s):
1. SAA
2. OfficeVision
3. IBM
4. Standards, Interfaces, Compatibility
<fac> Factor(s):
<def> Definition(s):
OfficeVision - A series of integrated office automation applications from IBM that runs across all of its major coputer
families.
Systems Application Architecture (SAA) - A set of IBM standards that provide consistent user interfaces, programming
interfaces, and communications protocols among all IBM computers from micro to mainframe.

</top>

Hsin-Hsi Chen

3-43
TREC-3 查詢主題

<top>
<num> Number: 177
<title> Topic: English as the Official Language in U.S.
<desc> Description:
Document will provide arguments supporting the making of English the standard language of the U.S.
<narr> Narrative:
A relevant document will note instances in which English is favored as a standard language. Examples are the
positive results achieved by immigrants in the areas of acceptance, greater economic opportunity, and increased
academic achievement. Reports are also desired which describe some of the language difficulties encountered by
other nations and groups of nations, e.g., Canada, Belgium, European Community, when they have opted for the use of
two or more languages as their official means of communication. Not relevant are reports which promote
bilingualism or multilingualism.
</top>

Hsin-Hsi Chen

3-44
Sample Topics used in TREC-3
<num>Number: 168
<title>Topic: Financing AMTRAK
<desc>Description:
A document will address the role of the Federal Government in financing
the operation of the National Railroad Transportation Corporation (AMTRAK)
<narr>Narrative:A relevant document must provide information on the
government’s responsibility to make AMTRAK an economically viable entity.
It could also discuss the privatization of AMTRAK as an alternative to
continuing government subsides. Document comparing government subsides
given to air and bus transportation with those provided to AMTRAK would also
be relevant.

Hsin-Hsi Chen

3-45
Features of topics in TREC-3
•
•
•
•

The topics are shorter.
The topics miss the complex structure of the earlier topics.
The concept field has been removed.
The topics were written by the same group of users that did
assessments.
• Summary:
– TREC-1 and 2 (1-150): suited to the routing task
– TREC-3 (151-200): suited to the ad-hoc task

Hsin-Hsi Chen

3-46
TREC-4 查 詢主題
<top>
<num> Number: 217
<desc> Description:
Reporting on possibility of and search for extra-terrestrial life/intelligence.
</top>

TREC-4 只留下主題欄位, TREC-5 將查詢主題調整回 TREC-3
相似結構,但平均長度較短。

Hsin-Hsi Chen

3-47
字數
欄位

TREC ~查詢主題

Total

最小
字數
44

(

包含停字

最大
字數
250

)

平均
字數
107.4

• 主題建構
• 主題篩選
– pre-search
– 判斷相關文件的數量

Hsin-Hsi Chen

TREC-3
(151-200)
TREC-4
(201-250)
TREC-5
(251-300)

TREC-6
(301-350)

11

3.8

Description

5

41

17.9

23

209

64.5

4

111

21.2

Total
TREC-2
(101-150)

1

Concepts

• 主題結構與長度

Title
Narrative

TREC-1
(51-100)

54

231

130.8

Title

2

9

4.9

Description

6

41

18.7

Narrative
Concepts
Total

27
3
49

165
88
180

78.8
28.5
103.4

Title
Description
Narrative
Total

2
9
26
8

20
42
146
33

6.5
22.3
74.6
16.3

Description
Total
Title

8
29
2

33
213
10

16.3
82.7
3.8

Description
Narrative
Total

6
19
47

40
168
156

15.7
63.2
88.4

Title
Description

1
5

5
62

2.7
20.4

142

65.3

Narrative

17
3-48
TREC-6 之主題篩選程序
在 PRISE 系統中輸入關鍵字執行檢索
前 25 篇文章中有多少篇是相關的?
0

1-5

6-20

≧ 20

繼續閱讀檢索出的

式,輸入更多的查詢

第 26-100 篇文件,

問句,再次執行檢

判斷其相關性

不採納此主題

不採納此主題

根據相關回饋等方

索,並判斷前 100 篇
文件的相關性
記錄相關文件的數量

Hsin-Hsi Chen

3-49
The Relevance Judgments
• For each topic, compile a list of relevant documents.
• approaches
– full relevance judgments (impossible)
judge over 1M documents for each topic, result in 100M judgments
– random sample of documents (insufficient relevance sample)
relevance judgments done on the random sample only
– TREC approach (pooling method)
make relevance judgments on the sample of documents selected by
various participating systems
assumption: the vast majority of relevant documents have been found and
that documents that have not been judged can be assumed to be no
relevant

• pooling method
– Take the top 100 documents retrieved by each system for a given topic.
– Merge them into a pool for relevance assessment.
– The sample is given to human assessors for relevance judgments.

Hsin-Hsi Chen

3-50
TREC ~相關判斷
• 判斷方法
– Pooling Method
– 人工判斷

• 判斷基準 : 二元式 , 相關與不相關
• 相關判斷品質
– 完整性
– 一致性

Hsin-Hsi Chen

3-51
Pooling 法
• 針對每個查詢主題,從參與評比的各系統所送
回之測試結果中抽取出前 n 篇文件,合併形成
一個 Pool
• 視為該查詢主題可能的相關文件候選集合,將
集合中重覆的文件去除後,再送回給該查詢主
題的原始建構者進行相關判斷。
• 利用此法的精神是希望能透過多個不同的系統
與不同的檢索技術,盡量網羅可能的相關文件
,藉此減少人工判斷的負荷。
Hsin-Hsi Chen

3-52
Overlap of Submitted Results
unique

TREC-1 (TREC-2): top 100 documents for each run (33 runs & 40 runs)
TREC-3: top 100 (200) documents for each run (48 runs)
After pooling, each topic was judged by a single assessor to insure the best
consistency of judgment.
TREC-1 和 TREC-2 runs 的個數差 7 個,檢索所得的 unique documents 個數
(39% vs. 28%) 差異不大,經人判定相關的文件數目差異也不大 (22% vs.
19%) 。
TREC-3 提供判斷的文件取兩倍大, unique 部份差異不大 (21% vs. 20%) ,經
經人判定相關的文件數目差異也不大 (15% vs. 10%) 。 3-53
Hsin-Hsi Chen
TREC 候選集合與實際相關文件之對照表

Routing

Adhoc

Pool
Pool
各系統送至
中實際
各系統送至
中實際
實際相關
實際相關
Pool
Pool
內之文 之文件數
內之文 之文件數
(
) 文件數
(
) 文件數
件總數
去除重覆
件總數
去除重覆
TREC-1
8800
1279(39%) 277(22%) TREC-1
2200
1067(49%) 371(35%)
TREC-2

4000

1106(28%) 210(19%) TREC-2

4000

1466(37%) 210(14%)

TREC-3

2700

1005(37%) 146(15%) TREC-3

2300

703(31%)

146(21%)

TREC-4

7300

1711(24%) 130(08%) TREC-4

3800

957(25%)

132(14%)

TREC-5

10100

2671(27%) 110(04%) TREC-5

3100

955(31%)

113(12%)

TREC-6

8480

1445(42%)

4400

1306(30%) 140(11%)

Hsin-Hsi Chen

92(6.4%) TREC-6

3-54
TREC 之相關判斷結果記錄
查詢主題序號
編號
54 0 FB6-F004-0059 0
54 0 FB6-F004-0073 1
54 0 FB6-F004-0077 1
54 0 FB6-F004-0078 1
54 0 FB6-F004-0080 1
54 0 FB6-F004-0083 1

文件集

54 0 FB6-F004-0087 1
54 0 FB6-F004-0089 1
54 0 FB6-F004-0090 1
54 0 FB6-F004-0092 1
54 0 FB6-F004-0094 1
54 0 FB6-F004-0095 1

54 0 FB6-F004-0096 1
54 0 FB6-F004-0098 1
54 0 FB6-F004-0100 1
54 0 FB6-F004-0102 1
54 0 FB6-F004-0104 1
54 0 FB6-F004-0105 1

決策層級

Hsin-Hsi Chen

3-55
TREC ~評比
Tasks/Tracks

TREC1

TREC2

TREC3 TREC4 TREC5 TREC6 TREC7

Routing

Main Tasks

Adhoc
Confusion

Confusion

Spoken Document
Retrieval
Database Merging
Filtering
High Precision
Interactive
Cross Language

Multilingual

Spanish
Chinese

Natural Language Processing
Query
Very Large Corpus

Hsin-Hsi Chen

3-56
TREC-7
• Ad hoc task
– Participants will receive 5 gigabytes of data for use in training
their systems.
– The 350 topics used in the first six TREC workshops and the
relevance judgments for those topics will also be available.
– The 50 new test topics (351-400) will be distributed in June and
will be used to search the document collection consisting of the
documents on TREC disks 4 and 5.
– Results will be submitted to NIST as the ranked top 1000
documents retrieved for each topic.

Hsin-Hsi Chen

3-57
TREC-7 (Continued)
• Track tasks
– Filtering Track
• A task in which the topics are stable (and some relevant
documents are known) but there is a stream of new documents.
• For each document, the system must make a binary decision as
to whether the document should be retrieved (as opposed to
forming a ranked list).

– Cross-Language Track
• An ad hoc task in which some documents are in English, some
in German, and others in French.
• The focus of the track will be to retrieve documents that
pertain to the topic regardless of language.

Hsin-Hsi Chen

3-58
TREC-7 (Continued)
• High Precision User Track
– An ad hoc task in which participants are given five minutes per
topic to produce a retrieved set using any means desired (e.g.,
through user interaction, completely automatically).

• Interactive Track
– A task used to study user interaction with text retrieval systems.

• Query Track
– A track designed to foster research on the effects of query
variability and analysis on retrieval performance.
– Participants each construct several different versions of existing
TREC topics, some versions as natural language topics and some
as structured queries in a common format.
– All groups then run all versions of the topics.

Hsin-Hsi Chen

3-59
TREC-7 (Continued)
• Spoken Document Retrieval Track
– An ad hoc task that investigates a retrieval system's ability to
retrieve spoken document (recordings of speech).

• Very Large Corpus (VLC)
– An ad hoc task that investigates the ability of retrieval systems to
handle larger amounts of data. The current target corpus size is
approximately 100 gigabytes.

Hsin-Hsi Chen

3-60
Categories of Query Construction
• AUTOMATIC
completely automatic initial query construction

• MANUAL
manual initial construction

• INTERACTIVE
use of interactive techniques to construct the queries

Hsin-Hsi Chen

3-61
Levels of Participation
• Category A: full participation
• Category B:
full participation using a reduced database
• Category C: evaluation only
• submit up to two runs for routing task, the adhoc
task, or both
• send in the top 1000 documents retrieved for each
topic for evaluation
Hsin-Hsi Chen

3-62
TREC-3 Participants
(14 companies, 19 universities)

Hsin-Hsi Chen

3-63
TREC-6
Apple Computer
AT&T Labs Research
Australian National Univ.
Carnegie Mellon Univ.
CEA (France)
Center for Inf. Res., Russia
Duke Univ./Univ. of Colorado/Bellcore
ETH (Switzerland)
FS Consulting, Inc.
GE Corp./Rutgers Univ.
George Mason Univ./NCR Corp
Harris Corp.
IBM T.J. Waston Res. (2 groups)
ISS (Singapore)
ITI (Singapore)
APL, Johns Hopkins Univ.
……………

Hsin-Hsi Chen

3-64
Evaluation Measures at TREC
• Summary table statistics
– The number of topics used in the task
– The number of documents retrieved over all topics
– The number of relevant documents which were
effectively retrieved for all topics

• Recall-precision averages
• Document level averages
– Average precision at specified document cutoff values
(e.g., 5, 10, 20, 100 relevant documents)

• Average precision histogram
Hsin-Hsi Chen

3-65
TREC ~質疑與負面評價
• 測試集方面

– 查詢主題
• 並非真實的使用者需求,過於人工化
• 缺乏需求情境的描述
– 相關判斷
• 二元式的相關判斷不實際
• pooling method 會遺失相關文件,導致回收率不準確
• 品質與一致性

• 效益測量方面

– 只關注量化測量
– 回收率的問題
– 適合作系統間的比較,但不適合作評估

Hsin-Hsi Chen

3-66
TREC ~質疑與負面評價
• 評比程序方面
– 互動式檢索
• 缺乏使用者介入
• 靜態的資訊需求不切實際

Hsin-Hsi Chen

3-67

(續)
NTCIR ~簡介
• NTCIR: NACSIS Test Collections for IR
• 主辦 : NACSIS( 日本國家科學資訊系統中心 )
• 發展背景
– 大型日文標竿測試集的需求
– 跨語言檢索的研究發展需要

• 文件集
– 來源為 NACSIS Academic Conference Papers Database
– 主要為會議論文的摘要
– 超過 330,000 篇文件 , 其中超過 1/2 為英日文對照之文
件
– 有部分包含
Hsin-Hsi Chen part-of-speech tags
3-68
NTCIR ~查詢主題
• 來源 : 搜集真實的使用者需求 , 再加以修正改寫
• 已有 100 個查詢主題,分屬不同學科領域
• 組成結構
<TOPIC q=nnnn> 編號
<title> 標題 </title>
<description> 資訊需求之簡短描述 </description>
<narrative> 資訊需求之細部描述 , 包括更進一步的解釋 ,
名詞的定義 , 背景知識 , 檢索的目的 , 預期的相關文
件數量 , 希望的文件類型 , 相關判斷的標準等
</narrative>
Hsin-Hsi Chen
<concepts> 相關概念的關鍵詞 </concepts> 3-69
NTCIR ~相關判斷
• 判斷方法
– 利用 pooling method 先進行篩選
– 由各主題專家,及查詢主題的建構者進行判斷

• 判斷基準
– A: 相關
– B: 部分相關
– C: 不相關

• 精確率計算 : 依測試項目的不同而有不同
– Relevant quel: B 與 C 均視為不相關
– Partial Relevant quel: A 與 B 均視為相關
Hsin-Hsi Chen
3-70
NTCIR ~評比
• Ad-hoc Information Retrieval Task
• Cross-lingual Information Retrieval Task
– 利用日文查詢主題檢索英文文件
– 共有 21 個查詢主題,其相關判斷包括英文文件與日文
文件
– 系統可選擇自動或人工建立查詢問題
– 系統需送回前 1000 篇檢索結果

• Automatic Term Extraction and Role Analysis Task
– Automatic Term Extraction :從題名與摘要中抽取出
technical terms
– Role Analysis Task
Hsin-Hsi Chen
3-71
NTCIR Workshop 2
• organizers
– Hsin-Hsi Chen (Chinese IR track)
– Noriko Kando (Japanese IR track)
– Sung-Hyon Myaeng (Korean IR track)

• Chinese test collection
– developer: Professor Kuang-hua Chen (LIS,
NTU)
– Document collection: 132,173 news stories
– Topics: 50
Hsin-Hsi Chen

3-72
NTCIR 2 schedule
• Someday in April, 2000: Call for Participation
• May or later: Training set will be distributed
• August, 2000: Test Documents and Topics will be
distributed.
• Sept.10-30, 2000: Results submission
• Jan., 2001: Evaluation results will be distributed.
• Feb. 1, 2001: Paper submission for working notes
• Feb. 19-22, 2001 (or Feb. 26-March 1): Workshop
(in Tokyo)
• March, 2001: Proceedings
Hsin-Hsi Chen

3-73
IREX ~簡介
• IREX: Information Retrieval and Extraction Exercise
• 主辦 : Information Processing Society of Japan
• 參加者 : 約 20 隊 ( 或以上 )
• 預備測試:利用 BMIR-J2 測試集中之查詢主題
• 文件集
– 每日新聞 , 1994-1995
– 參加者必須購買新聞語料

Hsin-Hsi Chen

3-74
IREX ~查詢主題
• 組成結構
<topic_id> 編號 </topic_id>
<description> 簡短的資訊需求 , 主要為名詞與其修飾
語
構成的名詞詞組 </description>
<narrative> 詳細的資訊需求 , 以自然語言敘述 , 通常
為2
至 3 個句子組 成 , 亦包含名詞解釋
, 同義詞
或實例 . </narrative>
– description 欄位中的詞彙必須包含在 narrative 欄位中
Hsin-Hsi Chen

3-75
IREX ~相關判斷
• 判斷依據 : 測試主題的所有欄位
• 判斷方法 : 由學生二名進行判斷
– 若二人之判斷結果一致,則完成相關判斷
– 若二人之判斷結果不一致或不確定,則由三人來作最後
的判定

• 判斷基準
– 學生 : 6 個判斷層次
• A: 相關
• B: 部分相關
分相關
Hsin-Hsi Chen
• C: 不相關

A?: 不確定是否為相關
B?: 不確定是否為部
3-76
C?: 不確定是否為不相關
IREX ~相關判斷

(續)

– 最終判斷者 : 3 個判斷層次
• A: 相關
• B: 部分相關
• C: 不相關

• 相關判斷的修正

Hsin-Hsi Chen

3-77
IREX ~評比
• 評比項目
– Name Entity Task (NE)
• 與 MUC 相似,測試系統自動抽取專有名詞的能力,如
組織名、人名、地名等 .
• 一般領域文件抽取 v.s. 特殊領域文件抽取

– Information Retrieval (IR)
• 與 TREC 相似

• 評比規則
– 送回文件:前 300 篇
Hsin-Hsi Chen

3-78
BMIR-J2 ~簡介
• 第一個日文資訊檢索系統測試集
– BMIR-J1: 1996
– BMIR-J2: 1998.3

• 發展單位 : IPSG-SIGDS
• 文件集 : 主要為新聞文件
– 每日新聞 : 5080 篇
– 經濟與工程

• 查詢主題 : 60 個
Hsin-Hsi Chen

3-79
BMIR-J2 ~相關判斷
• 以布林邏輯結合關鍵詞檢索 1-2 個 IR 系統
• 由資料庫檢索者做進一步的相關判斷
• 由建構測試集的人員再次檢查

Hsin-Hsi Chen

3-80
BMIR-J2 ~查詢主題
Q: F=oxoxo: “Utilizing solar energy”
Q: N-1: Retrieve texts mentioning user of solar energy
Q: N-2: Include texts concerning generating electricity and drying
things with solar heat.

• 查詢主題的分類

– 目的 : 標明該測試主題的特性 , 以利系統選擇
– 標記 : o(necessary), x(unnecessary)
– 類別
•
•
•
•
•

The basic function
The numeric range function
The syntactic function
The semantic function
The world knowledge function:

Hsin-Hsi Chen

3-81
AMARYLLIS ~簡介
• 主辦: INIST (INstitute of Information Scientific
and Technique)
• 參加者 : 約近 10 隊
• 文件集
– 新聞文件 : The World, 共 2 萬餘篇
– Pascal(1984-1995) 及 Francis(1992-1995) 資料中抽取
出來的文件題名與摘要部分,共 30 餘萬篇

Hsin-Hsi Chen

3-82
AMARYLLIS ~查詢主題
• 組成結構
<num> 編號 </num>
<dom> 所屬之學科領域 </dom>
<suj> 標題 </suj>
<que> 資訊需求之簡單描述 </que>
<cinf> 資訊需求之詳細描述 </cinf>
<ccept><c> 概念 , 敘述語 </ccept></c>

Hsin-Hsi Chen

3-83
AMARYLLIS ~相關判斷
• 原始的相關判斷
– 由文件集之擁有者負責建構

• 標準答案的修正
– 加入
• 不在最初的標準答案中,但被一半以上的參加者檢
索出來的文件
• 參加者所送回的檢索結果中的前 10 篇的文件
– 減去
• 在原始的標準答案中出現,但在參加者送回的檢索
結果中未出現的文件
Hsin-Hsi Chen
3-84
AMARYLLIS ~評比
• 系統需送回檢索結果的前 250 篇
• 系統可選擇採取自動或人工的方式建立 query
• 評比項目
– Routing Task
– Adhoc Task

Hsin-Hsi Chen

3-85
An Evaluation of Query Processing Strategies
Using the Tipster Collection
(SIGIR 1993: 347-355)
James P. Callan and W. Bruce Croft

Hsin-Hsi Chen

3-86
INQUERY Information Retrieval System
• Documents are indexed by the word stems and numbers
that occur in the text.
• Documents are also indexed automatically by a small
number of features that provide a controlled indexing
vocabulary.
• When a document refers to a company by name, the
document is indexed by the company name and the feature
#company.
• INQUERY includes company, country, U.S. city, number
and date, and person name recognizer.
Hsin-Hsi Chen

3-87
INQUERY Information Retrieval System
• feature operators
#company operator matches the #company feature
• proximity operators
require their arguments to occur either in order, within
some distance of each other, or within some window
• belief operators
use the maximum, sum, or weighted sum of a set of beliefs
• synonym operators
• Boolean operators
Hsin-Hsi Chen

3-88
Query Transformation in INQUERY
•
•
•
•

Discard stop phrases.
Recognize phrases by stochastic part of speech tagger.
Look for word “not” in the query.
Recognize proper names by assuming that a sequence of
capitalized words is a proper name.
• Introduce synonyms by a small set of words that occur in
the Factors field of TIPSTER topics.
• Introduce controlled vocabulary terms (feature operators).

Hsin-Hsi Chen

3-89
Techniques for Creating Ad Hoc Queries
• Simple Queries (description-only approach)
– Use the contents of Description field of TIPSTER topics only.
– Explore how the system behaves with the very short queries.

• Multiple Sources of Information (multiple-field approach)
– Use the contents of the Description, Title, Narrative, Concept(s)
and Factor(s) fields.
– Explore how a system might behave with an elaborate user
interface or very sophisticated query processing

• Interactive Query Creation
– Automatic query creation followed by simple manual
modifications.
– Simulate simple user interaction with the query processing.

Hsin-Hsi Chen

3-90
Simple Queries
• A query is constructed automatically by employing all the
query processing transformations on Description field.
• The remaining words and operators are enclosed in a
weighted sum operator.
• 11-point average precision

Hsin-Hsi Chen

3-91
Hsin-Hsi Chen

3-92
Multiple Sources of Information
+phrases
•
(all fields)Q-1: Created automatically, using T, D, N, C and F fields.
-synonym Everything except the synonym and concept operators was
-concept discarded from the the Narrative field. (baseline model)

•
-phrases Q-3: The same as Q-1, except that recognition of phrases
-proper and proper names was disabled. (words-only query)
Names
To determine whether phrase and proximity operators were helpful.
(all fields)

• Q-4: The same as Q-1, except that recognition of phrases
+phrases was applied to the Narrative field.

(narrative To determine whether the simple query processing transformation
Field)
would be effective on the abstract descriptions in the Narrative field.
-phrases
(other fields)

Hsin-Hsi Chen

3-93
Multiple Sources of Information (Continued)
• Q-6: The same as Q-1, except that only the T, C, and F
-Des
fields were used.
-Narr

Narrow in on the set of fields that appeared most useful.

• Q-F: The same as Q-1, with 5 additional thesaurus words

+thesaurus
+phrases or phrases added automatically to each query
an approach to automatically discovering thesaurus terms

• Q-7: A combination of Q-1 and Q-6

whether combining the results of two relatively similar queries could
yield an improvement
乍看之下, Q-6 似乎是 Q-1 的一部份,沒有合併的需要,但是仔
細想想還是不一樣。如果選擇 terms 時,依某個標準, Q-1T,C,F
可能只取一小部份,但 Q-6 就不同。

Hsin-Hsi Chen

3-94
A Comparison of Six Automatic Methods of Constructing AdHoc Queries
Discarding the Description
Q-1 and Q-6, which are
and Narrative fields did not
similar, retrieve different
hurt performance appreciably.
sets of documents.
Phrases from the Narrative It is possible to automatically
construct a useful thesaurus for
were not helpful.
a collection.

Phrases improved performance
at low recall

Hsin-Hsi Chen

3-95
Interactive Query Creation
• The system created a query using method Q-1, and then a
person was permitted to modify the resulting query.
• Modifications
– add words from the Narrative field
– delete words or phrases from the query
– indicate that certain words or phrases should occur near each other
within a document

• Q-M

+addition Manual addition of words or phrases from the Narrative, and manual
+deletion deletion of words or phrases from the query

• Q-O

+addition
The same as Q-M, except that the user could also indicate that certain
+deletion words or phrases must occur within 50 words of each other
+proximity

Hsin-Hsi Chen

3-96
Paragraph
retrieval
(within 50
words)
significantly
improves
effectiveness

Recall levels
(10%-60%)
acceptable because
users are not likely
to examine all
documents retrieved

Hsin-Hsi Chen

3-97
The effects of thesaurus terms and phrases on queries
that were created automatically and modified manually

Inclusion of unordered
window operators

Q-MF:
Thesaurus expansion
Before modification
Q-OF:
Thesaurus expansion
After modification

Thesaurus words and phrases
were added after the query
was modified, so they were
not used in unordered window
operators

Hsin-Hsi Chen

Cf. Q-O (42.7)

3-98
Okapi at TREC3 and TREC4
SE Robertson, S Walker, S Jones, MM
Hancock-Beaulieu, M Gatford
Department of Information Science
City University
Hsin-Hsi Chen

3-99
sim(d j , q) ≈

P(d j | R)
P(d j | R)

P (ki | R) × (1 − P (ki | R ))
≈ ∑ g i (d j ) g i (q) × log
P (ki | R) × (1 − P (ki | R ))
i =1
t

Vi + 0.5
V + 0.5 V − Vi + 0.5
P ( ki | R ) =
1 − P(ki | R) = 1 − i
=
V +1
V +1
V +1
ni − Vi + 0.5
n − V + 0.5 N − V − ni + Vi + 0.5
P ( ki | R ) =
1 − P(ki | R) = 1 − i i
=
N −V +1
N −V +1
N −V +1

Vi + 0.5 N − V − ni + Vi + 0.5
×
N −V +1
sim(d j , q) ≈ log V + 1
ni − Vi + 0.5 V − Vi + 0.5
×
N −V +1
V +1
(Vi + 0.5) × ( N − V − ni + Vi + 0.5)
= log
(ni − Vi + 0.5) × (V − Vi + 0.5) 3-100
Hsin-Hsi Chen
BM25 function in Okapi
(k1 + 1)tf (k3 + 1)qtf
avdl − dl
∑ w K + tf k + qtf + k2 | Q | avdl + dl
T ∈Q
3
(1)

Q: a query, containing terms T
(r + 0.5) × ( N − n − R + r + 0.5)
w(1): Robertson-Sparck Jones weight log (n − r + 0.5) × ( R − r + 0.5)
N: the number of documents in the collection (note: N)
n: the number of documents containing the term (note: n i)
R: the number of documents known to be relevant to a specific topic (note: V)
r: the number of relevant documents containing the term (note: V i)
K: k1((1-b)+b*dl/avdl)
k1, b, k2 and k3: parameters depend on the database and nature of topics
in TREC4 experiments, k1, k3 and b were 1.0-2.0, 8 and
0.6-0.75, respectively., and k2 was zero throughout
tf: frequency of occurrence of the term within a specific document (note: k i)
qtf: the frequency of the term within the topic from which Q was derived
dl: document length
avdl: averageChen
Hsin-Hsi document length
3-101

Mais conteúdo relacionado

Destaque

Risparmio bolletta e costi telefonici per le aziende
Risparmio bolletta e costi telefonici per le aziendeRisparmio bolletta e costi telefonici per le aziende
Risparmio bolletta e costi telefonici per le aziendewww.profweb.it
 
Regency Capital Partners Marketing Jan 09
Regency Capital Partners Marketing Jan 09Regency Capital Partners Marketing Jan 09
Regency Capital Partners Marketing Jan 09Mark Hill
 
Analisi Traffico Telefonico , Gestione automatica bollette
Analisi Traffico Telefonico , Gestione automatica bollette  Analisi Traffico Telefonico , Gestione automatica bollette
Analisi Traffico Telefonico , Gestione automatica bollette www.profweb.it
 
C:\Fakepath\Health Eoc Review
C:\Fakepath\Health Eoc ReviewC:\Fakepath\Health Eoc Review
C:\Fakepath\Health Eoc ReviewLeo Hsu
 
H:\Facts\Role Models
H:\Facts\Role ModelsH:\Facts\Role Models
H:\Facts\Role Modelsguest321520a
 
Health Eoc Review 1
Health Eoc Review 1Health Eoc Review 1
Health Eoc Review 1Leo Hsu
 
Health Eoc Review
Health Eoc ReviewHealth Eoc Review
Health Eoc ReviewLeo Hsu
 

Destaque (7)

Risparmio bolletta e costi telefonici per le aziende
Risparmio bolletta e costi telefonici per le aziendeRisparmio bolletta e costi telefonici per le aziende
Risparmio bolletta e costi telefonici per le aziende
 
Regency Capital Partners Marketing Jan 09
Regency Capital Partners Marketing Jan 09Regency Capital Partners Marketing Jan 09
Regency Capital Partners Marketing Jan 09
 
Analisi Traffico Telefonico , Gestione automatica bollette
Analisi Traffico Telefonico , Gestione automatica bollette  Analisi Traffico Telefonico , Gestione automatica bollette
Analisi Traffico Telefonico , Gestione automatica bollette
 
C:\Fakepath\Health Eoc Review
C:\Fakepath\Health Eoc ReviewC:\Fakepath\Health Eoc Review
C:\Fakepath\Health Eoc Review
 
H:\Facts\Role Models
H:\Facts\Role ModelsH:\Facts\Role Models
H:\Facts\Role Models
 
Health Eoc Review 1
Health Eoc Review 1Health Eoc Review 1
Health Eoc Review 1
 
Health Eoc Review
Health Eoc ReviewHealth Eoc Review
Health Eoc Review
 

Semelhante a Chapter 3 retrieval evaluation

Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...
Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...
Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...Alexandros Karatzoglou
 
learned optimizer.pptx
learned optimizer.pptxlearned optimizer.pptx
learned optimizer.pptxQingsong Guo
 
Data analysis with R
Data analysis with RData analysis with R
Data analysis with RShareThis
 
CS-102 DS-class_01_02 Lectures Data .pdf
CS-102 DS-class_01_02 Lectures Data .pdfCS-102 DS-class_01_02 Lectures Data .pdf
CS-102 DS-class_01_02 Lectures Data .pdfssuser034ce1
 
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화NAVER Engineering
 
C++ Notes PPT.ppt
C++ Notes PPT.pptC++ Notes PPT.ppt
C++ Notes PPT.pptAlpha474815
 
lecture1.ppt
lecture1.pptlecture1.ppt
lecture1.pptSagarDR5
 
DS Unit-1.pptx very easy to understand..
DS Unit-1.pptx very easy to understand..DS Unit-1.pptx very easy to understand..
DS Unit-1.pptx very easy to understand..KarthikeyaLanka1
 
Chapter 5 Query Evaluation.pdf
Chapter 5 Query Evaluation.pdfChapter 5 Query Evaluation.pdf
Chapter 5 Query Evaluation.pdfHabtamu100
 
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...MLconf
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningBig_Data_Ukraine
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnDataRobot
 
Performance evaluation of IR models
Performance evaluation of IR modelsPerformance evaluation of IR models
Performance evaluation of IR modelsNisha Arankandath
 
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...PyData
 
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...Julián Urbano
 
Design Patterns using Amazon DynamoDB
 Design Patterns using Amazon DynamoDB Design Patterns using Amazon DynamoDB
Design Patterns using Amazon DynamoDBAmazon Web Services
 

Semelhante a Chapter 3 retrieval evaluation (20)

Big datacourse
Big datacourseBig datacourse
Big datacourse
 
Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...
Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...
Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...
 
learned optimizer.pptx
learned optimizer.pptxlearned optimizer.pptx
learned optimizer.pptx
 
Rtutorial
RtutorialRtutorial
Rtutorial
 
Data analysis with R
Data analysis with RData analysis with R
Data analysis with R
 
CS-102 DS-class_01_02 Lectures Data .pdf
CS-102 DS-class_01_02 Lectures Data .pdfCS-102 DS-class_01_02 Lectures Data .pdf
CS-102 DS-class_01_02 Lectures Data .pdf
 
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
 
C++ Notes PPT.ppt
C++ Notes PPT.pptC++ Notes PPT.ppt
C++ Notes PPT.ppt
 
lecture1.ppt
lecture1.pptlecture1.ppt
lecture1.ppt
 
DS Unit-1.pptx very easy to understand..
DS Unit-1.pptx very easy to understand..DS Unit-1.pptx very easy to understand..
DS Unit-1.pptx very easy to understand..
 
Chapter 5 Query Evaluation.pdf
Chapter 5 Query Evaluation.pdfChapter 5 Query Evaluation.pdf
Chapter 5 Query Evaluation.pdf
 
R Language Introduction
R Language IntroductionR Language Introduction
R Language Introduction
 
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
Performance evaluation of IR models
Performance evaluation of IR modelsPerformance evaluation of IR models
Performance evaluation of IR models
 
R
RR
R
 
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
 
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
 
Design Patterns using Amazon DynamoDB
 Design Patterns using Amazon DynamoDB Design Patterns using Amazon DynamoDB
Design Patterns using Amazon DynamoDB
 

Último

Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptxPoojaSen20
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 

Último (20)

Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptx
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 

Chapter 3 retrieval evaluation

  • 1. Chapter 3 Retrieval Evaluation Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University Hsin-Hsi Chen 3-1
  • 2. Evaluation • Function analysis • Time and space – The shorter the response time, the smaller the space used, the better the system is • Performance evaluation (for data retrieval) – – – – Performance of the indexing structure The interaction with the operating systems The delays in communication channels The overheads introduced by software layers • Performance evaluation (for information retrieval) – Besides time and space, retrieval performance is an issue Hsin-Hsi Chen 3-2
  • 3. Retrieval Performance Evaluation • Retrieval task – Batch mode • The user submits a query and receives an answer back • How the answer set is generated – Interactive mode • The user specifies his information need through a series of interactive steps with the system • Aspects – – – – User effort characteristics of interface design guidance provided by the system duration of the session Hsin-Hsi Chen 3-3
  • 4. Recall and Precision • Recall | Ra | |R| – the fraction of the relevant documents which has been retrieved • Precision | Ra | | A| – the fraction of the retrieved documents which is relevant Relevant Docs in Answer Set |Ra| Relevant Docs Hsin-Hsi Chen |R| collection Answer Set |A| 3-4
  • 5. precision versus recall curve • The user is not usually presented with all the documents in the answer set A at once • Example Rq={d3,d5,d9,d25,d39,d44,d56,d71,d89,d123} (100%,10%) (precision, recall) Ranking for query q by a retrieval algorithm 1. d123 • 6. d9 • 11. d38 2. d84 7. d511 12. d48 3. d56 • 8. d129 13. d250 4. d6 9. d187 14. d113 5. d8 10. d25 • 15. d3 • (66%,20%) Hsin-Hsi Chen (50%,30%) (40%,40%) (33%,50%) 3-5
  • 6. 11 standard recall levels for a query • precision versus recall based on 11 standard recall levels: 0%, 10%, 20%, …, 100% p r e c i s i o n interpolation 120 100 80 60 40 20 0 20 Hsin-Hsi Chen 40 60 recall 80 100 120 3-6
  • 7. 11 standard recall levels for several queries • average the precision figures at each recall level Nq Pi (r ) P(r ) = ∑ i =1 Nq • P(r): the average precision at the recall level r • Nq: the number of queries used • Pi(r): the precision at recall level r for the i-th query Hsin-Hsi Chen 3-7
  • 8. necessity of interpolation procedure • Rq={d3,d56,d129} 1. d123 2. d84 3. d56 • 4. d6 5. d8 (33.3%,33.3%) (precision, recall) 6. d9 7. d511 8. d129 • 9. d187 10. d25 (25%,66.6%) 11. d38 12. d48 13. d250 14. d113 15. d3 • (20%,100%) How about the precision figures at the recall levels 0, 0.1, 0.2, 0.3, …, 1? Hsin-Hsi Chen 3-8
  • 9. interpolation procedure • rj (j ∈ {0,1,2,…,10}): a reference to the j-th standard recall level (e.g., r5 references to the recall level 50%) d56 • (33.3%,33.3%) • P(rj)=max rj≤r≤rj+1P(r) d129 • (25%,66.6%) d3 • (20%,100%) • Example r0: (33.33%,0%) r3: (33.33%,30%) r6: (25%,60%) r9: (20%,90%) Hsin-Hsi Chen r1: (33.33%,10%) r4: (25%,40%) r7: (20%,70%) r10: (20%,100%) r2: (33.33%,20%) r5: (25%,50%) r8: (20%,80%) interpolated precision 3-9
  • 10. Precision versus recall figures compare the retrieval performance of distinct retrieve algorithms over a set of example queries • The curve of precision versus recall which results from averaging the results for various queries 100 p 90 80 r e 70 c 60 i 50 s 40 i 30 o 20 n 10 0 20 Hsin-Hsi Chen 40 60 recall 80 100 3-10 120
  • 11. Average Precision at given Document Cutoff Values • Compute the average precision when 5, 10, 15, 20, 30, 50 or 100 relevant documents have been seen. • Provide additional information on the retrieval performance of the ranking algorithm Hsin-Hsi Chen 3-11
  • 12. Single Value Summaries compare the retrieval performance of a retrieval algorithm for individual queries • Average precision at seen relevant documents – Generate a single value summary of the ranking by averaging the precision figures obtained after each new relevant document is observed – Example 1. d123 •(1) 6. d9 • (0.5) 11. d38 2. d84 7. d511 12. d48 3. d56 • (0.66) 8. d129 13. d250 4. d6 9. d187 14. d113 5. d8 10. d25 • (0.4)15. d3 • (0.33) (1+0.66+0.5+0.4+0.33)/5=0.57 Favor Hsin-Hsi Chen systems which retrieve relevant documents quickly 3-12
  • 13. Single Value Summaries (Continued) • Reciprocal Rank (RR) – Equals to precision at the 1st retrieved relevant document – Useful for tasks need only 1 relevant document ex: Question & Answering • Mean Reciprocal Rank (MRR) – The mean of RR over several queries Hsin-Hsi Chen 3-13
  • 14. Single Value Summaries (Continued) • R-Precision – Generate a single value summary of ranking by computing the precision at the R-th position in the ranking, where R is the total number of relevant documents for the current query 1. d123 • 6. d9 • 2. d84 7. d511 3. d56 • 8. d129 4. d6 9. d187 5. d8 10. d25 • R=10 and # relevant=4 R-precision=4/10=0.4 Hsin-Hsi Chen 2. 1. 2. d123 d84 3. 56 • R=3 and # relevant=1 R-precision=1/3=0.33 3-14
  • 15. Single Value Summaries (Continued) • Precision Histograms – A R-precision graph for several queries – Compare the retrieval history of two algorithms RPA / B (i ) = RPA (i ) − RPB (i ) where RPA (i ) and RPB (i ) are R − precision values of retrieval a lg orithms A and B for the i − th query – RPA/B=0: both algorithms have equivalent performance for the i-the query – RPA/B>0: A has better retrieval performance for query i – RPA/B<0: B has better retrieval performance for query i Hsin-Hsi Chen 3-15
  • 17. Summary Table Statistics • Statistical summary regarding the set of all the queries in a retrieval task – the number of queries used in the task – the total number of documents retrieved by all queries – the total number of relevant documents which were effectively retrieved when all queries are considered – the total number of relevant documents which could have been retrieved by all queries – … Hsin-Hsi Chen 3-17
  • 18. Precision and Recall Appropriateness • Estimation of maximal recall requires knowledge of all the documents in the collection • Recall and precision capture different aspects of the set of retrieved documents • Recall and precision measure the effectiveness over queries in batch mode • Recall and precision are defined under the enforcement of linear ordering of the retrieved documents Hsin-Hsi Chen 3-18
  • 19. The Harmonic Mean • harmonic mean F(j) of recall and precision F ( j) = 2 1 1 + R( j ) P( j ) • R(j): the recall for the j-th document in the ranking • P(j): the precision for the j-th document in the ranking 2× P × R F= P+R Hsin-Hsi Chen 3-19
  • 20. Example 1. d123 2. d84 3. d56 • 4. d6 5. d8 (33.3%,33.3%) F (3) = 2 1 1 + 0.33 0.33 Hsin-Hsi Chen 6. d9 7. d511 8. d129 • 9. d187 10. d25 (25%,66.6%) = 0.33 F (8) = 11. d38 12. d48 13. d250 14. d113 15. d3 • (20%,100%) 2 1 1 + 0.25 0.67 = 0.36 F (15) = 2 1 1 + 0.20 1 3-20 = 0.33
  • 21. The E Measure • E evaluation measure – Allow the user to specify whether he is more interested in recall or precision E( j) = 1 − 1 + b2 b2 1 + R( j ) P( j ) ( β + 1) × P × R F= 2 β ×P+R 2 Hsin-Hsi Chen 3-21
  • 22. User-oriented measures • Basic assumption of previous evaluation – The set of relevant documents for a query is the same, independent of the user • User-oriented measures – – – – coverage ratio novelty ratio relative recall recall effort Hsin-Hsi Chen 3-22
  • 23. | Rk | cov erage = |U | high coverage ratio: system finds most of the relevant documents the user expected to see | Ru | high novelty ratio: the system reveals many new novelty = relevant documents which were | Ru | + | Rk | previously unknown Relevant Docs |R| relative recall= | Rk | + | Ru | |U | Answer Set |A| (proposed by system) recall effort: Relevant Docs known to the user |U| Relevant Docs known to the User which were retrieved |Rk| Hsin-Hsi Chen # of relevant docs the user expected to find/# of docs examined to find the expected relevant docs Relevant Docs previously unknown to the user which were retrieved |Ru| 3-23
  • 24. A More Modern Relevance Metric for Web Search • Normalized Discounted Cumulated Gain (NDCG) – K. Jaervelin and J. Kekaelaeinen (TOIS 2002) – Gain: relevance of a document is no more binary – Sensitive to the position of highest rated documents • Log-discounting of gains according to the positions – Normalize the DCG with the “ideal set” DCG. Hsin-Hsi Chen 3-24
  • 25. NDCG Example • Assume that the relevance scores 0 – 3 are used. G’=<3, 2, 3, 0, 0, 1, 2, 2, 3, 0, …> • Cumulated Gain (CG) G[1], if i = 1 CG[i ] =  CG[i − 1] + G[i ], otherwise CG’=<3, 5, 8, 8, 8, 9, 11, 13, 16, 16, …> Hsin-Hsi Chen 3-25
  • 26. NDCG Example (Continued) • Discounted Cumulated Gain (DCG) G[1], if i = 1 DCG[i ] =   DCG[i − 1] + G[i ] log b i , otherwise let b=2, DCG’=<3, 5, 6.89, 6.89, 6.89, 7.28, 7.99, 8.66, 9.61, 9.61, …> • Normalized Discounted Cumulated Gain (NDCG) Ideal vector I’=<3, 3, 3, 2, 2, 2, 1, 1, 1, 1, 0, 0, 0, …> CGI’=<3, 6, 9, 11, 13, 15, 16, 17, 18, 19, 19, 19, 19, …> DCGI’=<3, 6, 7.89, 8.89, 9.75, 10.52, 10.88, 11.21, 11.53, 11.83, 1.83, …>  NDCG’=<1, 0.83, 0.89, 0.73, 0.62, 0.6, 0.69, 0.76, 0.89, 0.84, …> Hsin-Hsi Chen 3-26
  • 27. 測試集 (Test Collections) • 組成要素 – 文件集 (Document Set; Document Collection) – 查詢問題 (Query; Topic) – 相關判斷 (Relevant Judgment) • 用途 – 設計與發展 : 系統測試 – 評估 : 系統效能 (Effectiveness) 之測量 – 比較 : 不同系統與不同技術間之比較 • 評比 – 根據不同的目的而有不同的評比項目 – 量化的測量準則,如 Precision 與 Recall Hsin-Hsi Chen 3-27
  • 28. 測試集 (Test Collections) ( 續 ) • 小型測試集 – 早期 : Cranfield – 英文 : SMART Collections, OHSUMED, Cystic Fibrosis, LISA…. – 日文 : BMIR-J2 • 大型評比環境 : 提供測試集及研討的論壇 – 美國 : TREC – 日本 : NTCIR, IREX – 歐洲 : AMARYLLIS, CLEF Hsin-Hsi Chen 3-28
  • 29. 各測試集之基本資料 測試集 Cranfield II ADI 文件數 1,400 82 MEDLARS TIME 1,033 423 相關判 文件集 斷層次 平均相關 查詢 大小 平均字數 平均字數 文件數 主題領域 不 語文 (MB) /文件 問題數 /查詢問題 /查詢問題 相 相 關 關 1.6 53.1 225 9.2 7.2 4 1 太空動力學 英文 0.04 27.1 35 14.6 9.5 N/A 文獻學 英文 1.1 51.6 30 10.1 23.2 2 2 醫學 英文 1.5 570 24 16.0 8.7 N/A 世界情勢 英文 2.2 24.5 64 10.8 15.3 A C M N/A 通訊 英文 2.2 46.5 112 28.3 49.8 N/A 資訊科學 英文 CACM 3,204 CISI 1,460 NPL 11,429 3.1 20.0 100 7.2 22.4 電 子 、 電 腦 、 N/A 英文 物理、地理 INSPEC 12,684 N/A 32.5 84 15.6 33.0 800 N/A N/A 63 N/A 8.4 UKCIS 27,361 N/A 182 193 N/A 57 UKAEA 12,765 N/A N/A 60 N/A N/A LISA 6,004 3.4 N/A 35 N/A 10.8 物理、電 2 1 子、控制 1 1 文獻學 2 2 生化 2 1 核子科學 N/A N/A Cystic Fibrosis 1,239 N/A 49.7 100 6.8 6.4-31.9 醫學 6 1 N/A 250 101 10 17/19.4 N/A 2 1 5,080 N/A 621.8 60 102.2 10.6/28.4 英文 TREC 1,754,896 ~5GB 481.6 (TREC-1~6) 經 濟 、工 程 2 1 350 105.8 185.3 多主題 1 1 英文 多主題 N/A 近期測試集: (1) 多主題全文及詳細的查詢 問題 (2) 大規模 英文 348,566 早期測試集: (1) 簡短書目資料,如題名 ,摘要,關鍵詞等組成 (2) 專門主題領域 法文 ISILT OSHUMED BMIR-J2 AMARYLLIS 336,000 201 N/A 56 N/A N/A NTCIR 300,000 N/A N/A 100 N/A N/A N/A N/A N/A N/A N/A N/A IREX Hsin-Hsi Chen 多主題 多主題 2 1 2 1 英文 英文 英文 英文 英文 日文 日文 日文 3-29
  • 30. Cranfield II (ftp://ftp.cs.cornell.edu/pub/smart/cran/) • 比較 33 種不同索引方式之檢索效益 • 蒐集 1400 篇有關太空動力學的文件 ( 摘 要形式 ) ,請每位作者根據這些文件與其 當時研究的主題提出問題,經篩選後產 生 200 多個查詢問題 .I 001 .W what similarity laws must be obeyed when constructing aeroelastic models of heated high speed aircraft? Hsin-Hsi Chen 3-30
  • 31. Cranfield II (Continued) • Cranfield II 測試集中相關判斷建立四個步驟 找 可 能 相 關 的 文 件 驗 證 – 首先請提出查詢問題的建構者,對文件後所附之引 用及參考文獻進行,相關判斷 – 接著請五位該領域的研究生,將 查詢問題與每篇文 件逐一檢視,共花了 1500 小時,進行了 50 萬次以 上的相關判斷,希望能找出所有的相關文件。 – 為了避免前述過程仍有遺漏,又利用文獻耦合的概 念,計算文件間之相關性。發掘更多可能的相關文 件。若有兩篇以上的文獻,共同引用了一篇或多篇 論文,則稱這些文獻間具有耦合關係。 – 最後,將以上找出的所有文件,再一併送回給原作 者進行判斷。 Hsin-Hsi Chen 3-31
  • 32. TREC ~簡介 • TREC: Text REtrieval Conference • 主辦 : NIST 及 DARPA ,為 TIPSTER 文件計劃之子計 劃之一 • Leader: Donna Harman (Manager of The Natural Language Processing and Information Retrieval Group of the Information Access and User Interfaces Division, NIST) • 文件集 – 5GB 以上 – 數百萬篇文件 Hsin-Hsi Chen 3-32
  • 33. History • • • • TREC-1 (Text Retrieval Conference) Nov 1992 TECC-2 Aug 1993 TREC-3 TREC-7 January 16, 1998 -- submit application to NIST. Beginning February 2 -- document disks distributed to those new participants who have returned the required forms. June 1 -- 50 new test topics for ad hoc task distributed August 3 -- ad hoc results due at NIST September 1 -- latest track submission deadline. September 4 -- speaker proposals due at NIST. October 1 -- relevance judgments and individual evaluation scores due back to participants Nov. 9-11-- TREC-7 conference at NIST in Gaithersburg, Md. TREC-8 (1999) TREC-9 (2000) TREC-10 (2001) … Hsin-Hsi Chen 3-33
  • 34. The Test Collection • the documents • the example information requests (called topics in TREC) • the relevant judgments (right answers) Hsin-Hsi Chen 3-34
  • 35. The Documents • Disk 1 (1GB) – – – – – WSJ: Wall Street Journal (1987, 1988, 1989) 華爾街日報 AP: AP Newswire (1989) 美聯社 ZIFF: Articles from Computer Select disks (Ziff-Davis Publishing) FR: Federal Register (1989) 美國聯邦政府公報 DOE: Short abstracts from DOE publications • Disk2 (1GB) – – – – WSJ: Wall Street Journal (1990, 1991, 1992) AP: AP Newswire (1988) ZIFF: Articles from Computer Select disks FR: Federal Register (1988) Hsin-Hsi Chen 3-35
  • 36. The Documents (Continued) • Disk 3 (1 GB) – – – – SJMN: San Jose Mercury News (1991) 聖荷西水星報 AP: AP Newswire (1990) ZIFF: Articles from Computer Select disks PAT: U.S. Patents (1993) • Statistics – document lengths DOE (very short documents) vs. FR (very long documents) – range of document lengths AP (similar in length) vs. WSJ and ZIFF (wider range of lengths) Hsin-Hsi Chen 3-36
  • 37. TREC 文件集 Volume 1 2 3 4 5 Routing Test Data DOE (very short documents) vs. FR (very long documents) AP (similar in length) vs. WSJ and ZIFF (wider range of lengths) Revised Sources Wall Street Journal, 1978-1989 Associated Press newswire, 1989 March Computer Selects Articles, Ziff-Davis 1994 Federal Register, 1989 Abstracts of U.S. DOE publications Wall Street Journal, 1990-1992(WSJ) March Associated Press newswire(1988)(AP) 1994 Computer Selects articles, Ziff-Davis(ZIFF) Federal Register(1988)(FR88) San Jose Mercury News, 1991 March Associated Press newswire, 1990 1994 Computer Selects articles, Ziff-Davis U.S. patents, 1993 The Financial Times, 1991-1994(FT) May 1996 Federal Register, 1994(FR94) Congressional Record, 1993(CR) April Foreign Broadcast Information Service(FBIS) 1997 Los Angeles Times (1989, 1990) Foreign Broadcast Information Service(FBIS) Hsin-Hsi Chen Size (M B ) Docs 267 254 242 260 184 242 237 175 209 287 237 345 243 564 395 235 470 475 98,732 84,678 75,180 25,960 226,087 74,520 79,919 56,920 19,860 90,257 78,321 161,021 6,711 210,158 55,630 27,922 130,471 131,896 245 446 200 391 111 301 438 182 396 379 451 122 4445 316 588 288 322 351 434.0 473.9 473.0 1315.9 120.4 508.4 468.7 451.9 1378.1 453.0 478.4 295.4 5391.0 412.7 644.7 1373.5 543.6 526.5 490 120,653 348 581.3 Median # Mean # Terms/Doc Terms/Doc 3-37
  • 38. Document Format (in Standard Generalized Mark-up Language, SGML) <DOC> <DOCNO>WSJ880406-0090</DOCNO> <HL>AT&T Unveils Services to Upgrade Phone Networks Under Global Plan </HL> <AUTHOR>Janet Guyon (WSJ staff) </AUTHOR> <DATELINE>New York</DATELINE> <TEXT> American Telephone & Telegraph Co. introduced the first of a new generation of phone services with broad implications for computer and communications . . </TEXT> </DOC> Hsin-Hsi Chen 3-38
  • 39. TREC 之文件標示 <DOC> <DOCN0>FT911-3</DOCN0> <PROFILE>AN-BE0A7AAIFT</PROFILE> <DATE>910514 </DATE> <HEADLINE> FT 14 MAY 91 / International Company News: Contigas plans DM900m east German project </HEADLINE> <BYLINE> By DAVID GOODHART </BYLINE> <DATELINE> BONN </DATELINE> <TEXT> CONTIGAS, the German gas group 81 per cent owned by the utility Bayernwerk, said yesterday that it intends to invest DM900m (Dollars 522m) in the next jour years to build a new gas distribution system in the east German state of Thuringia. … </TEXT> </DOC> Hsin-Hsi Chen 3-39
  • 40. The Topics • Issue 1 – allow a wide range of query construction methods – keep the topic (user need) distinct from the query (the actual text submitted to the system) • Issue 2 – increase the amount of information available about each topic – include with each topic a clear statement of what criteria make a document relevant • TREC – 50 topics/year, 400 topics (TREC1~TREC7) Hsin-Hsi Chen 3-40
  • 41. Sample Topics used in TREC-1 and TREC-2 <top> <head>Tipster Topic Description <num>Number: 066 <dom>Domain: Science and Technology <title>Topic: Natural Language Processing <desc>Description: (one sentence description) Document will identify a type of natural language processing technology which is being developed or marketed in the U.S. <narr>Narrative: (complete description of document relevance for assessors) A relevant document will identify a company or institution developing or marketing a natural language processing technology, identify the technology, and identify one or more features of the company’s product. <con>Concepts: (a mini-knowledge base about topic such as a real searcher 1. natural language processing might possess) 2. translation, language, dictionary, font 3. software applications Hsin-Hsi Chen 3-41
  • 42. <fac> Factors (allow easier automatic query building by listing specific <nat> Nationality: U.S. items from the narrative that </fact> constraint the documents that <def>Definition(s): are relevant) </top> Hsin-Hsi Chen 3-42
  • 43. TREC-1 and TREC-2 查詢主題 <top> <head> Tipster Topic Description <num> Number: 037 <dom> Domain: Science and Technology <title> Topic: Identify SAA components <desc> Description: Document identifies software products which adhere to IBM's SAA standards. <narr> Narrative: To be relevant, a document must identify a piece of software which is considered a Systems Application Architectural (SAA) component or one which conforms to SAA. <con> Concept(s): 1. SAA 2. OfficeVision 3. IBM 4. Standards, Interfaces, Compatibility <fac> Factor(s): <def> Definition(s): OfficeVision - A series of integrated office automation applications from IBM that runs across all of its major coputer families. Systems Application Architecture (SAA) - A set of IBM standards that provide consistent user interfaces, programming interfaces, and communications protocols among all IBM computers from micro to mainframe. </top> Hsin-Hsi Chen 3-43
  • 44. TREC-3 查詢主題 <top> <num> Number: 177 <title> Topic: English as the Official Language in U.S. <desc> Description: Document will provide arguments supporting the making of English the standard language of the U.S. <narr> Narrative: A relevant document will note instances in which English is favored as a standard language. Examples are the positive results achieved by immigrants in the areas of acceptance, greater economic opportunity, and increased academic achievement. Reports are also desired which describe some of the language difficulties encountered by other nations and groups of nations, e.g., Canada, Belgium, European Community, when they have opted for the use of two or more languages as their official means of communication. Not relevant are reports which promote bilingualism or multilingualism. </top> Hsin-Hsi Chen 3-44
  • 45. Sample Topics used in TREC-3 <num>Number: 168 <title>Topic: Financing AMTRAK <desc>Description: A document will address the role of the Federal Government in financing the operation of the National Railroad Transportation Corporation (AMTRAK) <narr>Narrative:A relevant document must provide information on the government’s responsibility to make AMTRAK an economically viable entity. It could also discuss the privatization of AMTRAK as an alternative to continuing government subsides. Document comparing government subsides given to air and bus transportation with those provided to AMTRAK would also be relevant. Hsin-Hsi Chen 3-45
  • 46. Features of topics in TREC-3 • • • • The topics are shorter. The topics miss the complex structure of the earlier topics. The concept field has been removed. The topics were written by the same group of users that did assessments. • Summary: – TREC-1 and 2 (1-150): suited to the routing task – TREC-3 (151-200): suited to the ad-hoc task Hsin-Hsi Chen 3-46
  • 47. TREC-4 查 詢主題 <top> <num> Number: 217 <desc> Description: Reporting on possibility of and search for extra-terrestrial life/intelligence. </top> TREC-4 只留下主題欄位, TREC-5 將查詢主題調整回 TREC-3 相似結構,但平均長度較短。 Hsin-Hsi Chen 3-47
  • 48. 字數 欄位 TREC ~查詢主題 Total 最小 字數 44 ( 包含停字 最大 字數 250 ) 平均 字數 107.4 • 主題建構 • 主題篩選 – pre-search – 判斷相關文件的數量 Hsin-Hsi Chen TREC-3 (151-200) TREC-4 (201-250) TREC-5 (251-300) TREC-6 (301-350) 11 3.8 Description 5 41 17.9 23 209 64.5 4 111 21.2 Total TREC-2 (101-150) 1 Concepts • 主題結構與長度 Title Narrative TREC-1 (51-100) 54 231 130.8 Title 2 9 4.9 Description 6 41 18.7 Narrative Concepts Total 27 3 49 165 88 180 78.8 28.5 103.4 Title Description Narrative Total 2 9 26 8 20 42 146 33 6.5 22.3 74.6 16.3 Description Total Title 8 29 2 33 213 10 16.3 82.7 3.8 Description Narrative Total 6 19 47 40 168 156 15.7 63.2 88.4 Title Description 1 5 5 62 2.7 20.4 142 65.3 Narrative 17 3-48
  • 49. TREC-6 之主題篩選程序 在 PRISE 系統中輸入關鍵字執行檢索 前 25 篇文章中有多少篇是相關的? 0 1-5 6-20 ≧ 20 繼續閱讀檢索出的 式,輸入更多的查詢 第 26-100 篇文件, 問句,再次執行檢 判斷其相關性 不採納此主題 不採納此主題 根據相關回饋等方 索,並判斷前 100 篇 文件的相關性 記錄相關文件的數量 Hsin-Hsi Chen 3-49
  • 50. The Relevance Judgments • For each topic, compile a list of relevant documents. • approaches – full relevance judgments (impossible) judge over 1M documents for each topic, result in 100M judgments – random sample of documents (insufficient relevance sample) relevance judgments done on the random sample only – TREC approach (pooling method) make relevance judgments on the sample of documents selected by various participating systems assumption: the vast majority of relevant documents have been found and that documents that have not been judged can be assumed to be no relevant • pooling method – Take the top 100 documents retrieved by each system for a given topic. – Merge them into a pool for relevance assessment. – The sample is given to human assessors for relevance judgments. Hsin-Hsi Chen 3-50
  • 51. TREC ~相關判斷 • 判斷方法 – Pooling Method – 人工判斷 • 判斷基準 : 二元式 , 相關與不相關 • 相關判斷品質 – 完整性 – 一致性 Hsin-Hsi Chen 3-51
  • 52. Pooling 法 • 針對每個查詢主題,從參與評比的各系統所送 回之測試結果中抽取出前 n 篇文件,合併形成 一個 Pool • 視為該查詢主題可能的相關文件候選集合,將 集合中重覆的文件去除後,再送回給該查詢主 題的原始建構者進行相關判斷。 • 利用此法的精神是希望能透過多個不同的系統 與不同的檢索技術,盡量網羅可能的相關文件 ,藉此減少人工判斷的負荷。 Hsin-Hsi Chen 3-52
  • 53. Overlap of Submitted Results unique TREC-1 (TREC-2): top 100 documents for each run (33 runs & 40 runs) TREC-3: top 100 (200) documents for each run (48 runs) After pooling, each topic was judged by a single assessor to insure the best consistency of judgment. TREC-1 和 TREC-2 runs 的個數差 7 個,檢索所得的 unique documents 個數 (39% vs. 28%) 差異不大,經人判定相關的文件數目差異也不大 (22% vs. 19%) 。 TREC-3 提供判斷的文件取兩倍大, unique 部份差異不大 (21% vs. 20%) ,經 經人判定相關的文件數目差異也不大 (15% vs. 10%) 。 3-53 Hsin-Hsi Chen
  • 54. TREC 候選集合與實際相關文件之對照表 Routing Adhoc Pool Pool 各系統送至 中實際 各系統送至 中實際 實際相關 實際相關 Pool Pool 內之文 之文件數 內之文 之文件數 ( ) 文件數 ( ) 文件數 件總數 去除重覆 件總數 去除重覆 TREC-1 8800 1279(39%) 277(22%) TREC-1 2200 1067(49%) 371(35%) TREC-2 4000 1106(28%) 210(19%) TREC-2 4000 1466(37%) 210(14%) TREC-3 2700 1005(37%) 146(15%) TREC-3 2300 703(31%) 146(21%) TREC-4 7300 1711(24%) 130(08%) TREC-4 3800 957(25%) 132(14%) TREC-5 10100 2671(27%) 110(04%) TREC-5 3100 955(31%) 113(12%) TREC-6 8480 1445(42%) 4400 1306(30%) 140(11%) Hsin-Hsi Chen 92(6.4%) TREC-6 3-54
  • 55. TREC 之相關判斷結果記錄 查詢主題序號 編號 54 0 FB6-F004-0059 0 54 0 FB6-F004-0073 1 54 0 FB6-F004-0077 1 54 0 FB6-F004-0078 1 54 0 FB6-F004-0080 1 54 0 FB6-F004-0083 1 文件集 54 0 FB6-F004-0087 1 54 0 FB6-F004-0089 1 54 0 FB6-F004-0090 1 54 0 FB6-F004-0092 1 54 0 FB6-F004-0094 1 54 0 FB6-F004-0095 1 54 0 FB6-F004-0096 1 54 0 FB6-F004-0098 1 54 0 FB6-F004-0100 1 54 0 FB6-F004-0102 1 54 0 FB6-F004-0104 1 54 0 FB6-F004-0105 1 決策層級 Hsin-Hsi Chen 3-55
  • 56. TREC ~評比 Tasks/Tracks TREC1 TREC2 TREC3 TREC4 TREC5 TREC6 TREC7 Routing Main Tasks Adhoc Confusion Confusion Spoken Document Retrieval Database Merging Filtering High Precision Interactive Cross Language Multilingual Spanish Chinese Natural Language Processing Query Very Large Corpus Hsin-Hsi Chen 3-56
  • 57. TREC-7 • Ad hoc task – Participants will receive 5 gigabytes of data for use in training their systems. – The 350 topics used in the first six TREC workshops and the relevance judgments for those topics will also be available. – The 50 new test topics (351-400) will be distributed in June and will be used to search the document collection consisting of the documents on TREC disks 4 and 5. – Results will be submitted to NIST as the ranked top 1000 documents retrieved for each topic. Hsin-Hsi Chen 3-57
  • 58. TREC-7 (Continued) • Track tasks – Filtering Track • A task in which the topics are stable (and some relevant documents are known) but there is a stream of new documents. • For each document, the system must make a binary decision as to whether the document should be retrieved (as opposed to forming a ranked list). – Cross-Language Track • An ad hoc task in which some documents are in English, some in German, and others in French. • The focus of the track will be to retrieve documents that pertain to the topic regardless of language. Hsin-Hsi Chen 3-58
  • 59. TREC-7 (Continued) • High Precision User Track – An ad hoc task in which participants are given five minutes per topic to produce a retrieved set using any means desired (e.g., through user interaction, completely automatically). • Interactive Track – A task used to study user interaction with text retrieval systems. • Query Track – A track designed to foster research on the effects of query variability and analysis on retrieval performance. – Participants each construct several different versions of existing TREC topics, some versions as natural language topics and some as structured queries in a common format. – All groups then run all versions of the topics. Hsin-Hsi Chen 3-59
  • 60. TREC-7 (Continued) • Spoken Document Retrieval Track – An ad hoc task that investigates a retrieval system's ability to retrieve spoken document (recordings of speech). • Very Large Corpus (VLC) – An ad hoc task that investigates the ability of retrieval systems to handle larger amounts of data. The current target corpus size is approximately 100 gigabytes. Hsin-Hsi Chen 3-60
  • 61. Categories of Query Construction • AUTOMATIC completely automatic initial query construction • MANUAL manual initial construction • INTERACTIVE use of interactive techniques to construct the queries Hsin-Hsi Chen 3-61
  • 62. Levels of Participation • Category A: full participation • Category B: full participation using a reduced database • Category C: evaluation only • submit up to two runs for routing task, the adhoc task, or both • send in the top 1000 documents retrieved for each topic for evaluation Hsin-Hsi Chen 3-62
  • 63. TREC-3 Participants (14 companies, 19 universities) Hsin-Hsi Chen 3-63
  • 64. TREC-6 Apple Computer AT&T Labs Research Australian National Univ. Carnegie Mellon Univ. CEA (France) Center for Inf. Res., Russia Duke Univ./Univ. of Colorado/Bellcore ETH (Switzerland) FS Consulting, Inc. GE Corp./Rutgers Univ. George Mason Univ./NCR Corp Harris Corp. IBM T.J. Waston Res. (2 groups) ISS (Singapore) ITI (Singapore) APL, Johns Hopkins Univ. …………… Hsin-Hsi Chen 3-64
  • 65. Evaluation Measures at TREC • Summary table statistics – The number of topics used in the task – The number of documents retrieved over all topics – The number of relevant documents which were effectively retrieved for all topics • Recall-precision averages • Document level averages – Average precision at specified document cutoff values (e.g., 5, 10, 20, 100 relevant documents) • Average precision histogram Hsin-Hsi Chen 3-65
  • 66. TREC ~質疑與負面評價 • 測試集方面 – 查詢主題 • 並非真實的使用者需求,過於人工化 • 缺乏需求情境的描述 – 相關判斷 • 二元式的相關判斷不實際 • pooling method 會遺失相關文件,導致回收率不準確 • 品質與一致性 • 效益測量方面 – 只關注量化測量 – 回收率的問題 – 適合作系統間的比較,但不適合作評估 Hsin-Hsi Chen 3-66
  • 67. TREC ~質疑與負面評價 • 評比程序方面 – 互動式檢索 • 缺乏使用者介入 • 靜態的資訊需求不切實際 Hsin-Hsi Chen 3-67 (續)
  • 68. NTCIR ~簡介 • NTCIR: NACSIS Test Collections for IR • 主辦 : NACSIS( 日本國家科學資訊系統中心 ) • 發展背景 – 大型日文標竿測試集的需求 – 跨語言檢索的研究發展需要 • 文件集 – 來源為 NACSIS Academic Conference Papers Database – 主要為會議論文的摘要 – 超過 330,000 篇文件 , 其中超過 1/2 為英日文對照之文 件 – 有部分包含 Hsin-Hsi Chen part-of-speech tags 3-68
  • 69. NTCIR ~查詢主題 • 來源 : 搜集真實的使用者需求 , 再加以修正改寫 • 已有 100 個查詢主題,分屬不同學科領域 • 組成結構 <TOPIC q=nnnn> 編號 <title> 標題 </title> <description> 資訊需求之簡短描述 </description> <narrative> 資訊需求之細部描述 , 包括更進一步的解釋 , 名詞的定義 , 背景知識 , 檢索的目的 , 預期的相關文 件數量 , 希望的文件類型 , 相關判斷的標準等 </narrative> Hsin-Hsi Chen <concepts> 相關概念的關鍵詞 </concepts> 3-69
  • 70. NTCIR ~相關判斷 • 判斷方法 – 利用 pooling method 先進行篩選 – 由各主題專家,及查詢主題的建構者進行判斷 • 判斷基準 – A: 相關 – B: 部分相關 – C: 不相關 • 精確率計算 : 依測試項目的不同而有不同 – Relevant quel: B 與 C 均視為不相關 – Partial Relevant quel: A 與 B 均視為相關 Hsin-Hsi Chen 3-70
  • 71. NTCIR ~評比 • Ad-hoc Information Retrieval Task • Cross-lingual Information Retrieval Task – 利用日文查詢主題檢索英文文件 – 共有 21 個查詢主題,其相關判斷包括英文文件與日文 文件 – 系統可選擇自動或人工建立查詢問題 – 系統需送回前 1000 篇檢索結果 • Automatic Term Extraction and Role Analysis Task – Automatic Term Extraction :從題名與摘要中抽取出 technical terms – Role Analysis Task Hsin-Hsi Chen 3-71
  • 72. NTCIR Workshop 2 • organizers – Hsin-Hsi Chen (Chinese IR track) – Noriko Kando (Japanese IR track) – Sung-Hyon Myaeng (Korean IR track) • Chinese test collection – developer: Professor Kuang-hua Chen (LIS, NTU) – Document collection: 132,173 news stories – Topics: 50 Hsin-Hsi Chen 3-72
  • 73. NTCIR 2 schedule • Someday in April, 2000: Call for Participation • May or later: Training set will be distributed • August, 2000: Test Documents and Topics will be distributed. • Sept.10-30, 2000: Results submission • Jan., 2001: Evaluation results will be distributed. • Feb. 1, 2001: Paper submission for working notes • Feb. 19-22, 2001 (or Feb. 26-March 1): Workshop (in Tokyo) • March, 2001: Proceedings Hsin-Hsi Chen 3-73
  • 74. IREX ~簡介 • IREX: Information Retrieval and Extraction Exercise • 主辦 : Information Processing Society of Japan • 參加者 : 約 20 隊 ( 或以上 ) • 預備測試:利用 BMIR-J2 測試集中之查詢主題 • 文件集 – 每日新聞 , 1994-1995 – 參加者必須購買新聞語料 Hsin-Hsi Chen 3-74
  • 75. IREX ~查詢主題 • 組成結構 <topic_id> 編號 </topic_id> <description> 簡短的資訊需求 , 主要為名詞與其修飾 語 構成的名詞詞組 </description> <narrative> 詳細的資訊需求 , 以自然語言敘述 , 通常 為2 至 3 個句子組 成 , 亦包含名詞解釋 , 同義詞 或實例 . </narrative> – description 欄位中的詞彙必須包含在 narrative 欄位中 Hsin-Hsi Chen 3-75
  • 76. IREX ~相關判斷 • 判斷依據 : 測試主題的所有欄位 • 判斷方法 : 由學生二名進行判斷 – 若二人之判斷結果一致,則完成相關判斷 – 若二人之判斷結果不一致或不確定,則由三人來作最後 的判定 • 判斷基準 – 學生 : 6 個判斷層次 • A: 相關 • B: 部分相關 分相關 Hsin-Hsi Chen • C: 不相關 A?: 不確定是否為相關 B?: 不確定是否為部 3-76 C?: 不確定是否為不相關
  • 77. IREX ~相關判斷 (續) – 最終判斷者 : 3 個判斷層次 • A: 相關 • B: 部分相關 • C: 不相關 • 相關判斷的修正 Hsin-Hsi Chen 3-77
  • 78. IREX ~評比 • 評比項目 – Name Entity Task (NE) • 與 MUC 相似,測試系統自動抽取專有名詞的能力,如 組織名、人名、地名等 . • 一般領域文件抽取 v.s. 特殊領域文件抽取 – Information Retrieval (IR) • 與 TREC 相似 • 評比規則 – 送回文件:前 300 篇 Hsin-Hsi Chen 3-78
  • 79. BMIR-J2 ~簡介 • 第一個日文資訊檢索系統測試集 – BMIR-J1: 1996 – BMIR-J2: 1998.3 • 發展單位 : IPSG-SIGDS • 文件集 : 主要為新聞文件 – 每日新聞 : 5080 篇 – 經濟與工程 • 查詢主題 : 60 個 Hsin-Hsi Chen 3-79
  • 80. BMIR-J2 ~相關判斷 • 以布林邏輯結合關鍵詞檢索 1-2 個 IR 系統 • 由資料庫檢索者做進一步的相關判斷 • 由建構測試集的人員再次檢查 Hsin-Hsi Chen 3-80
  • 81. BMIR-J2 ~查詢主題 Q: F=oxoxo: “Utilizing solar energy” Q: N-1: Retrieve texts mentioning user of solar energy Q: N-2: Include texts concerning generating electricity and drying things with solar heat. • 查詢主題的分類 – 目的 : 標明該測試主題的特性 , 以利系統選擇 – 標記 : o(necessary), x(unnecessary) – 類別 • • • • • The basic function The numeric range function The syntactic function The semantic function The world knowledge function: Hsin-Hsi Chen 3-81
  • 82. AMARYLLIS ~簡介 • 主辦: INIST (INstitute of Information Scientific and Technique) • 參加者 : 約近 10 隊 • 文件集 – 新聞文件 : The World, 共 2 萬餘篇 – Pascal(1984-1995) 及 Francis(1992-1995) 資料中抽取 出來的文件題名與摘要部分,共 30 餘萬篇 Hsin-Hsi Chen 3-82
  • 83. AMARYLLIS ~查詢主題 • 組成結構 <num> 編號 </num> <dom> 所屬之學科領域 </dom> <suj> 標題 </suj> <que> 資訊需求之簡單描述 </que> <cinf> 資訊需求之詳細描述 </cinf> <ccept><c> 概念 , 敘述語 </ccept></c> Hsin-Hsi Chen 3-83
  • 84. AMARYLLIS ~相關判斷 • 原始的相關判斷 – 由文件集之擁有者負責建構 • 標準答案的修正 – 加入 • 不在最初的標準答案中,但被一半以上的參加者檢 索出來的文件 • 參加者所送回的檢索結果中的前 10 篇的文件 – 減去 • 在原始的標準答案中出現,但在參加者送回的檢索 結果中未出現的文件 Hsin-Hsi Chen 3-84
  • 85. AMARYLLIS ~評比 • 系統需送回檢索結果的前 250 篇 • 系統可選擇採取自動或人工的方式建立 query • 評比項目 – Routing Task – Adhoc Task Hsin-Hsi Chen 3-85
  • 86. An Evaluation of Query Processing Strategies Using the Tipster Collection (SIGIR 1993: 347-355) James P. Callan and W. Bruce Croft Hsin-Hsi Chen 3-86
  • 87. INQUERY Information Retrieval System • Documents are indexed by the word stems and numbers that occur in the text. • Documents are also indexed automatically by a small number of features that provide a controlled indexing vocabulary. • When a document refers to a company by name, the document is indexed by the company name and the feature #company. • INQUERY includes company, country, U.S. city, number and date, and person name recognizer. Hsin-Hsi Chen 3-87
  • 88. INQUERY Information Retrieval System • feature operators #company operator matches the #company feature • proximity operators require their arguments to occur either in order, within some distance of each other, or within some window • belief operators use the maximum, sum, or weighted sum of a set of beliefs • synonym operators • Boolean operators Hsin-Hsi Chen 3-88
  • 89. Query Transformation in INQUERY • • • • Discard stop phrases. Recognize phrases by stochastic part of speech tagger. Look for word “not” in the query. Recognize proper names by assuming that a sequence of capitalized words is a proper name. • Introduce synonyms by a small set of words that occur in the Factors field of TIPSTER topics. • Introduce controlled vocabulary terms (feature operators). Hsin-Hsi Chen 3-89
  • 90. Techniques for Creating Ad Hoc Queries • Simple Queries (description-only approach) – Use the contents of Description field of TIPSTER topics only. – Explore how the system behaves with the very short queries. • Multiple Sources of Information (multiple-field approach) – Use the contents of the Description, Title, Narrative, Concept(s) and Factor(s) fields. – Explore how a system might behave with an elaborate user interface or very sophisticated query processing • Interactive Query Creation – Automatic query creation followed by simple manual modifications. – Simulate simple user interaction with the query processing. Hsin-Hsi Chen 3-90
  • 91. Simple Queries • A query is constructed automatically by employing all the query processing transformations on Description field. • The remaining words and operators are enclosed in a weighted sum operator. • 11-point average precision Hsin-Hsi Chen 3-91
  • 93. Multiple Sources of Information +phrases • (all fields)Q-1: Created automatically, using T, D, N, C and F fields. -synonym Everything except the synonym and concept operators was -concept discarded from the the Narrative field. (baseline model) • -phrases Q-3: The same as Q-1, except that recognition of phrases -proper and proper names was disabled. (words-only query) Names To determine whether phrase and proximity operators were helpful. (all fields) • Q-4: The same as Q-1, except that recognition of phrases +phrases was applied to the Narrative field. (narrative To determine whether the simple query processing transformation Field) would be effective on the abstract descriptions in the Narrative field. -phrases (other fields) Hsin-Hsi Chen 3-93
  • 94. Multiple Sources of Information (Continued) • Q-6: The same as Q-1, except that only the T, C, and F -Des fields were used. -Narr Narrow in on the set of fields that appeared most useful. • Q-F: The same as Q-1, with 5 additional thesaurus words +thesaurus +phrases or phrases added automatically to each query an approach to automatically discovering thesaurus terms • Q-7: A combination of Q-1 and Q-6 whether combining the results of two relatively similar queries could yield an improvement 乍看之下, Q-6 似乎是 Q-1 的一部份,沒有合併的需要,但是仔 細想想還是不一樣。如果選擇 terms 時,依某個標準, Q-1T,C,F 可能只取一小部份,但 Q-6 就不同。 Hsin-Hsi Chen 3-94
  • 95. A Comparison of Six Automatic Methods of Constructing AdHoc Queries Discarding the Description Q-1 and Q-6, which are and Narrative fields did not similar, retrieve different hurt performance appreciably. sets of documents. Phrases from the Narrative It is possible to automatically construct a useful thesaurus for were not helpful. a collection. Phrases improved performance at low recall Hsin-Hsi Chen 3-95
  • 96. Interactive Query Creation • The system created a query using method Q-1, and then a person was permitted to modify the resulting query. • Modifications – add words from the Narrative field – delete words or phrases from the query – indicate that certain words or phrases should occur near each other within a document • Q-M +addition Manual addition of words or phrases from the Narrative, and manual +deletion deletion of words or phrases from the query • Q-O +addition The same as Q-M, except that the user could also indicate that certain +deletion words or phrases must occur within 50 words of each other +proximity Hsin-Hsi Chen 3-96
  • 97. Paragraph retrieval (within 50 words) significantly improves effectiveness Recall levels (10%-60%) acceptable because users are not likely to examine all documents retrieved Hsin-Hsi Chen 3-97
  • 98. The effects of thesaurus terms and phrases on queries that were created automatically and modified manually Inclusion of unordered window operators Q-MF: Thesaurus expansion Before modification Q-OF: Thesaurus expansion After modification Thesaurus words and phrases were added after the query was modified, so they were not used in unordered window operators Hsin-Hsi Chen Cf. Q-O (42.7) 3-98
  • 99. Okapi at TREC3 and TREC4 SE Robertson, S Walker, S Jones, MM Hancock-Beaulieu, M Gatford Department of Information Science City University Hsin-Hsi Chen 3-99
  • 100. sim(d j , q) ≈ P(d j | R) P(d j | R) P (ki | R) × (1 − P (ki | R )) ≈ ∑ g i (d j ) g i (q) × log P (ki | R) × (1 − P (ki | R )) i =1 t Vi + 0.5 V + 0.5 V − Vi + 0.5 P ( ki | R ) = 1 − P(ki | R) = 1 − i = V +1 V +1 V +1 ni − Vi + 0.5 n − V + 0.5 N − V − ni + Vi + 0.5 P ( ki | R ) = 1 − P(ki | R) = 1 − i i = N −V +1 N −V +1 N −V +1 Vi + 0.5 N − V − ni + Vi + 0.5 × N −V +1 sim(d j , q) ≈ log V + 1 ni − Vi + 0.5 V − Vi + 0.5 × N −V +1 V +1 (Vi + 0.5) × ( N − V − ni + Vi + 0.5) = log (ni − Vi + 0.5) × (V − Vi + 0.5) 3-100 Hsin-Hsi Chen
  • 101. BM25 function in Okapi (k1 + 1)tf (k3 + 1)qtf avdl − dl ∑ w K + tf k + qtf + k2 | Q | avdl + dl T ∈Q 3 (1) Q: a query, containing terms T (r + 0.5) × ( N − n − R + r + 0.5) w(1): Robertson-Sparck Jones weight log (n − r + 0.5) × ( R − r + 0.5) N: the number of documents in the collection (note: N) n: the number of documents containing the term (note: n i) R: the number of documents known to be relevant to a specific topic (note: V) r: the number of relevant documents containing the term (note: V i) K: k1((1-b)+b*dl/avdl) k1, b, k2 and k3: parameters depend on the database and nature of topics in TREC4 experiments, k1, k3 and b were 1.0-2.0, 8 and 0.6-0.75, respectively., and k2 was zero throughout tf: frequency of occurrence of the term within a specific document (note: k i) qtf: the frequency of the term within the topic from which Q was derived dl: document length avdl: averageChen Hsin-Hsi document length 3-101

Notas do Editor

  1. &lt;number&gt; 最近有一些proposals是以段落為基礎來作檢索, 將文件根據一些單位如sections, paragraphs, 或是固定長度的順序詞彙, 或是主題轉換的界線. 我們選擇進一步地探討passage retrieval, 實驗性地檢查一個架構, 在架構中每個詞彙都是a set of passages的開端, 如此, 從文件中抽取出許多passage, 在這個任意的passage retrieval架構中, 我們希望能確認passage 檢索中可能增進的最佳的效益, 並確認實際的, 有效的passage retrieval 架構的方向. 在這篇文章中我們比較了任意的段落檢索以及其他的段落檢索架構以及整篇文章檢索,