SlideShare uma empresa Scribd logo
1 de 51
Baixar para ler offline
Bringing Undergraduate Students
Closer to a Real-World
Information Retrieval Setting:
Methodology and Resources

Julián Urbano, Mónica Marrero,
Diego Martín and Jorge Morato
http://julian-urbano.info
Twitter: @julian_urbano


                                               ACM ITiCSE 2011
                                 Darmstadt, Germany · June 29th
3



Information Retrieval
• Automatically search documents
  ▫ Effectively
  ▫ Efficiently

• Expected between 2009 and 2020 [Gantz et al., 2010]
  ▫ 44 times more digital information
  ▫ 1.4 times more staff and investment

• IR is very important for the world
• IR is clearly important for Computer Science students
  ▫ Usually not in the core program of CS majors
    [Fernández-Luna et al., 2009]
4



Experiments in Computer Science
• Decisions in the CS industry
  ▫ Based on mere experience
  ▫ Bias towards or against some technologies

• CS professionals need background
  ▫ Experimental methods
  ▫ Critical analysis of experimental studies [IEEE/ACM, 2001]

• IR is a highly experimental discipline [Voorhees, 2002]
• Focus on evaluation of search engine effectiveness
  ▫ Evaluation experiments not common in IR courses
    [Fernández-Luna et al., 2009]
5


Proposal:
Teach IR through Experimentation
• Try to resemble a real-world IR scenario
 ▫ Relatively large amounts of information
 ▫ Specific characteristics and needs
 ▫ Evaluate what techniques are better

• Can we reliably adapt the methodologies of
  academic/industrial evaluation workshops?
 ▫ TREC, INEX, CLEF…
6



Current IR Courses
• Develop IR systems extending frameworks/tools
  ▫ IR Base [Calado et al., 2007]
  ▫ Alkaline, Greenstone, SpidersRUs [Chau et al., 2010]
  ▫ Lucene, Lemur

• Students should focus on the concepts
  and not struggle with the tools [Madnani et al., 2008]
7



Current IR Courses (II)
• Develop IR systems from scratch
 ▫ Build your search engine in 90 days [Chau et al., 2003]
 ▫ History Places [Hendry, 2007]

• Only possible in technical majors
 ▫   Software development
 ▫   Algorithms
 ▫   Data structures
 ▫   Database management

• Much more complicated than using other tools
• Much better to understand the IR techniques and
  processes explained
8



Current IR Courses (III)
• Students must learn the Scientific Method
 [IEEE/ACM Computing Curricula, 2001]



• Not only get the numbers [Markey et al., 1978]
  ▫ Where do they come from?
  ▫ What do they really mean?
  ▫ Why they are calculated that way?
• Analyze the reliability and validity

• Not usually paid much attention in IR courses
 [Fernández-Luna et al., 2009]
9



Current IR Courses (and IV)
• Modern CS industry is changing
  ▫ Big data

• Real conditions differ widely from class [Braught et al., 2004]
  ▫ Interesting from the experimental viewpoint

• Finding free collections to teach IR is hard
  ▫ Copyright
  ▫ Diversity

• Usually stick to the same collections year after year
  ▫ Try to make them a product of the course itself
10



Our IR Course
• Senior CS undergraduates
  ▫ So far: elective, 1 group of ~30 students
  ▫ From next year: mandatory, 2-3 groups of ~35 students

• Exercise: very simple Web crawler [Urbano et al., 2010]
• Search engine form scratch, in groups of 3-4 students
  ▫ Module 1: basic indexing and retrieval
  ▫ Module 2: query expansion
  ▫ Module 3: Named Entity Recognition
• Skeleton and evaluation framework are provided
• External libraries allowed for certain parts
  ▫ Stemmer, Wordnet, etc.

• Microsoft .net, C#, SQL Server Express
12



Evaluation in Early TREC Editions
                  Document collection,
                  depending on the task, domain…
13



Evaluation in Early TREC Editions
                  Document collection,
                  depending on the task, domain…




                                         Relevance assessors,
                         …               retired analysts
14



Evaluation in Early TREC Editions
                  Document collection,
                  depending on the task, domain…




                                               Candidate
                         …                     topics
15



Evaluation in Early TREC Editions
                  Document collection,
                  depending on the task, domain…




                         …                    Topic difficulty?
16



Evaluation in Early TREC Editions
                  Document collection,
                  depending on the task, domain…




                         …


                            Organizers
                            choose final ~50 topics
17



Evaluation in Early TREC Editions

                         …
18



Evaluation in Early TREC Editions

                          …



                              Participants
                      …
19



Evaluation in Early TREC Editions

                          …



                               Top 1000
                      …        results
                               per topic
20



Evaluation in Early TREC Editions

                            …




                      …



                      Organizers
21



Evaluation in Early TREC Editions
                      Top 100 results
22



Evaluation in Early TREC Editions
                      Top 100 results




                      Depth-100 pool

                      Depending on overlapping,
                      pool sizes vary (usually ~1/3)
23



Evaluation in Early TREC Editions
                      Top 100 results




                      Depth-100 pool

                      Depending on overlapping,
                      pool sizes vary (usually ~1/3)


     What documents
      are relevant?
24



Evaluation in Early TREC Editions
                        Top 100 results




                        Depth-100 pool

                        Depending on overlapping,
                        pool sizes vary (usually ~1/3)


     What documents
      are relevant?   Relevance judgments



        Organizers
25



Evaluation in Early TREC Editions
                        Top 100 results




                        Depth-100 pool

                        Depending on overlapping,
                        pool sizes vary (usually ~1/3)


     What documents
      are relevant?   Relevance judgments



        Organizers
26



Changes for the Course
                 Document collection,
                 can not be too large
27



Changes for the Course
                           Document collection,
                           can not be too large




                                                  Topics
                                  …               are restricted


• Different themes = many diverse documents
• Similar topics = fewer documents

• Start with a theme common to all topics
28



Changes for the Course
                                  Write topics
                          …       with a common theme




     How do we get relevant documents?
29



Changes for the Course
                                  Write topics
                    …             with a common theme

                  Try to answer the
                  topic ourselves
30



Changes for the Course
                                       Write topics
                         …             with a common theme

                       Try to answer the
                       topic ourselves




     Too difficult?
  Too few documents?
31



Changes for the Course
                                             Write topics
                               …             with a common theme

                             Try to answer the
                             topic ourselves




         Too difficult?
      Too few documents?                             Complete
                                                 document collection



Final topics
                           Crawler
32



Changes for the Course
• Relevance judging would be next

• Before students start developing their systems
 ▫ Cheating is very tempting: all (mine) relevant!

• Reliable pools without the student systems?

• Use other well-known systems: Lucene & Lemur
 ▫ These are the pooling systems
 ▫ Different techniques, similar to what we teach
33



Changes for the Course

                               …



 Lucene   Lemur   Lucene
                           …   Lemur
34



Changes for the Course

                                         …



  Lucene     Lemur      Lucene
                                 …       Lemur


                                   depth-k pools lead to
                                   very different sizes



Some students would judge many more documents than others
35



Changes for the Course
    Topic 001          Topic 002          Topic 003




    Topic 004          Topic 005           Topic N




• Instead of depth-k, pool until size-k
  ▫ Size-100 in our case
  ▫ About 2 hours to judge, possible to do in class
36



Changes for the Course
• Pooling systems might leave relevant material out
 ▫ Include Google’s top 10 results to each pool
    When choosing topics we knew there were some
• Students might do the judgments carelessly
 ▫ Crawl 2 noise topics with unrelated queries
 ▫ Add 10 noise documents to each pool and check

• Now keep pooling documents until size-100
 ▫ The union of the pools form the biased collection
    This is the one we gave students
37



2010 Test Collection
• 32 students, 5 faculty

• Theme: Computing
  ▫ 20 topics (plus 2 for noise)
     17 judged by two people
  ▫ 9,769 documents (735MB) in complete collection
  ▫ 1,967 documents (161MB) in biased collection

• Pool sizes between 100 and 105
  ▫ Average pool depth 27.5, from 15 to 58
38



Evaluation
• With results and judgments, rank student systems

• Compute mean NDCG scores [Järvelin et al., 2002]
  ▫ Are relevant documents properly ranked?
  ▫ Ranges between 0 (useless) to 1 (perfect)

• The group with the best system is rewarded
  with an extra point (over 10) in the final grade
  ▫ About half the students really motivated for this
39



Is This Evaluation Method Reliable?
• These evaluations have problems [Voorhees, 2002]
  ▫ Inconsistency of judgments [Voorhees, 2000]
     People have different notions of «relevance»
     Same document judged differently
  ▫ Incompleteness of judgments [Zobel, 1998]
     Documents not in the pool are considered not relevant
     What if some of them were actually relevant?


• TREC ad hoc was quite reliable
  ▫ Many topics
  ▫ Great man-power
40



Assessor Agreement
• 17 topics were judged by 2 people
 ▫ Mean Cohen’s Kappa = 0.417
     From 0.096 to 0.735
 ▫ Mean Precision-Recall = 0.657 - 0.638
     From 0.111 to 1


• For TREC, observed 0.65 precision at 0.65 recall
 [Voorhees, 2000]



• Only 1 noise document was judged relevant
Mean NDCG@100

                        0.2   0.3   0.4   0.5   0.6   0.7                           0.8

                 03-3
                 01-1
                 03-2
                 03-1
                 01-3
                 05-3
                 07-1
                 01-2
                                                                                          Inconsistency &




                 05-1
                 07-3
                 02-1
                 05-2
                                                                                          System Performance




                 07-2
                 08-1



Student system
                 04-1
                 02-3
                 08-2
                 02-2
                 08-3
                 04-2
                 04-3
                 06-3
                 06-1
                                                            Mean over 2,000 trels




                 06-2
                                                                                                               41
42


Inconsistency &
System Performance (and II)
• Take 2,000 random samples of assessors
  ▫ Each would have been possible if using 1 assessor
  ▫ NDCG differences between 0.075 and 0.146

• 1.5 times larger than in TREC [Voorhees, 2000]
  ▫ Unexperienced assessors
  ▫ 3-point relevance scale
  ▫ Larger values to begin with! (x2)
     Percentage differences are lower
43



Inconsistency & System Ranking
• Absolute numbers do not tell much, rankings do
  ▫ It is highly dependent on the collection

• Compute correlations with 2,000 random samples
  ▫ Mean Kendall’s tau = 0.926
     From 0.811 to 1


• For TREC, observed 0.938 [Voorhees, 2000]

• No swap was significant (Wilcoxon, α=0.05)
44


Incompleteness &
System Performance
                                 16
Increment in Mean NDCG@100 (%)




                                                                                   Mean difference over 2,000 trels
                                 14




                                                                                   Estimated difference
                                 12
                                 10
                                 8
                                 6
                                 4
                                 2
                                 0




                                      30   35   40   45   50   55   60   65   70    75   80    85    90   95   100

                                                                     Pool size
45


Incompleteness &
System Performance (and II)
• Start with size-20 pools and keep adding 5 more
  ▫ 20 to 25: NDCG variations between 14% and 37%
  ▫ 95 to 100: between 0.4% and 1.6%, mean 1%

• In TREC editions, between 0.5% and 3.5% [Zobel, 1998]
  ▫ Even some observations of 19%!

• We achieve these levels for sizes 60-65
46



Incompleteness & Effort
• Extrapolate to larger pools
 ▫ If differences decrease a lot, judge a little more
 ▫ If not, judge fewer documents but more topics

• 135 documents for mean NDCG differences <0.5%
 ▫ Not really worth the +35% effort
47



Student Perception
• Brings more work both for faculty and students

• More technical problems than previous years
 ▫ Survey scores dropped from 3.27 to 2.82 over 5
 ▫ Expected, they had to work considerably more
 ▫ Possible gaps in (our) CS program?

• Same satisfaction scores as previous years
 ▫ Very appealing for us
 ▫ This was our first year, lots of logistics problems
48



Conclusions & Future Work
• IR deserves more attention in CS education
• Laboratory experiments deserve it too

• Focus on the experimental nature of IR
 ▫ But give a wider perspective useful beyond IR

• Much more interesting and instructive

• Adapt well-known methodologies in
  industry/academia to our limited resources
49



Conclusions & Future Work (II)
• Improve the methodology
 ▫ Explore quality control techniques in crowdsourcing

• Integrate with other courses
 ▫ Computer Science
 ▫ Library Science

• Make everything public, free
 ▫ Tets collections and reports
 ▫ http://ir.kr.inf.uc3m.es/eirex/
50



Conclusions & Future Work (and III)
• We call all this EIREX
 ▫ Information Retrieval Education
   through Experimentation

• We would like others to participate
 ▫ Use it in your class
    Collections
    The methodology
    Tools
52



2011 Semester
• Only 19 students

• The 2010 collection used for testing
 ▫ Every year there will be more variety to choose

• 2011 collection: Crowdsourcing
 ▫ 23 topics, better described
    Created by the students
    Judgments by 1 student
 ▫ 13,245 documents (952MB) in complete collection
 ▫ 2,088 documents (96MB) in biased collection
• Overview report coming soon!
53




Can we reliably adapt the large-scale methodologies
 of academic/industrial IR evaluation workshops?

                  YES (we can)

Mais conteúdo relacionado

Semelhante a Bringing Undergraduate Students Closer to a Real-World Information Retrieval Setting: Methodology and Resources

Intra- and interdisciplinary cross-concordances for information retrieval
Intra- and interdisciplinary cross-concordances for information retrieval Intra- and interdisciplinary cross-concordances for information retrieval
Intra- and interdisciplinary cross-concordances for information retrieval
GESIS
 
Philosophy of IR Evaluation Ellen Voorhees
Philosophy of IR Evaluation Ellen VoorheesPhilosophy of IR Evaluation Ellen Voorhees
Philosophy of IR Evaluation Ellen Voorhees
k21jag
 
Word Clouds and Tag Cloud for Learning
Word Clouds and Tag Cloud for LearningWord Clouds and Tag Cloud for Learning
Word Clouds and Tag Cloud for Learning
Damian T. Gordon
 

Semelhante a Bringing Undergraduate Students Closer to a Real-World Information Retrieval Setting: Methodology and Resources (20)

Tools and Methodology for Research: Article Reading
Tools and Methodology for Research: Article ReadingTools and Methodology for Research: Article Reading
Tools and Methodology for Research: Article Reading
 
Data Structures and Algorithm - Week 4 - Trees, Binary Trees
Data Structures and Algorithm - Week 4 - Trees, Binary TreesData Structures and Algorithm - Week 4 - Trees, Binary Trees
Data Structures and Algorithm - Week 4 - Trees, Binary Trees
 
EC-TEL Doctoral Consortium
EC-TEL Doctoral ConsortiumEC-TEL Doctoral Consortium
EC-TEL Doctoral Consortium
 
2013-03-27 SITE TPACK symposium
2013-03-27 SITE TPACK symposium2013-03-27 SITE TPACK symposium
2013-03-27 SITE TPACK symposium
 
K-12 Computer Science Framework GaDOE Update
K-12 Computer Science Framework GaDOE UpdateK-12 Computer Science Framework GaDOE Update
K-12 Computer Science Framework GaDOE Update
 
Atn workshop 2010_asw2_a_slides
Atn workshop 2010_asw2_a_slidesAtn workshop 2010_asw2_a_slides
Atn workshop 2010_asw2_a_slides
 
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
 
Intra- and interdisciplinary cross-concordances for information retrieval
Intra- and interdisciplinary cross-concordances for information retrieval Intra- and interdisciplinary cross-concordances for information retrieval
Intra- and interdisciplinary cross-concordances for information retrieval
 
Intro to Deep Learning for Question Answering
Intro to Deep Learning for Question AnsweringIntro to Deep Learning for Question Answering
Intro to Deep Learning for Question Answering
 
Core Methods In Educational Data Mining
Core Methods In Educational Data MiningCore Methods In Educational Data Mining
Core Methods In Educational Data Mining
 
UTPL Educational Research
UTPL Educational Research UTPL Educational Research
UTPL Educational Research
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
 
Unit iii
Unit iiiUnit iii
Unit iii
 
HCI-Lecture-1
HCI-Lecture-1HCI-Lecture-1
HCI-Lecture-1
 
ESWC 2011 BLOOMS+
ESWC 2011 BLOOMS+ ESWC 2011 BLOOMS+
ESWC 2011 BLOOMS+
 
twintech wshop2-main ppt
twintech wshop2-main ppttwintech wshop2-main ppt
twintech wshop2-main ppt
 
Philosophy of IR Evaluation Ellen Voorhees
Philosophy of IR Evaluation Ellen VoorheesPhilosophy of IR Evaluation Ellen Voorhees
Philosophy of IR Evaluation Ellen Voorhees
 
Word Clouds and Tag Cloud for Learning
Word Clouds and Tag Cloud for LearningWord Clouds and Tag Cloud for Learning
Word Clouds and Tag Cloud for Learning
 
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
 
Www2012 tutorial content_aggregation
Www2012 tutorial content_aggregationWww2012 tutorial content_aggregation
Www2012 tutorial content_aggregation
 

Mais de Julián Urbano

Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Julián Urbano
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Julián Urbano
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection Reliability
Julián Urbano
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...
Julián Urbano
 

Mais de Julián Urbano (20)

Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
 
Your PhD and You
Your PhD and YouYour PhD and You
Your PhD and You
 
Statistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowStatistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and How
 
The Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationThe Treatment of Ties in AP Correlation
The Treatment of Ties in AP Correlation
 
A Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR EvaluationA Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR Evaluation
 
Crawling the Web for Structured Documents
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured Documents
 
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
 
A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...
 
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackThe University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
 
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
 
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
 
Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)
 
Evaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityEvaluation in Audio Music Similarity
Evaluation in Audio Music Similarity
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection Reliability
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...
 
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
 
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
 

Último

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Último (20)

This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 

Bringing Undergraduate Students Closer to a Real-World Information Retrieval Setting: Methodology and Resources

  • 1. Bringing Undergraduate Students Closer to a Real-World Information Retrieval Setting: Methodology and Resources Julián Urbano, Mónica Marrero, Diego Martín and Jorge Morato http://julian-urbano.info Twitter: @julian_urbano ACM ITiCSE 2011 Darmstadt, Germany · June 29th
  • 2. 3 Information Retrieval • Automatically search documents ▫ Effectively ▫ Efficiently • Expected between 2009 and 2020 [Gantz et al., 2010] ▫ 44 times more digital information ▫ 1.4 times more staff and investment • IR is very important for the world • IR is clearly important for Computer Science students ▫ Usually not in the core program of CS majors [Fernández-Luna et al., 2009]
  • 3. 4 Experiments in Computer Science • Decisions in the CS industry ▫ Based on mere experience ▫ Bias towards or against some technologies • CS professionals need background ▫ Experimental methods ▫ Critical analysis of experimental studies [IEEE/ACM, 2001] • IR is a highly experimental discipline [Voorhees, 2002] • Focus on evaluation of search engine effectiveness ▫ Evaluation experiments not common in IR courses [Fernández-Luna et al., 2009]
  • 4. 5 Proposal: Teach IR through Experimentation • Try to resemble a real-world IR scenario ▫ Relatively large amounts of information ▫ Specific characteristics and needs ▫ Evaluate what techniques are better • Can we reliably adapt the methodologies of academic/industrial evaluation workshops? ▫ TREC, INEX, CLEF…
  • 5. 6 Current IR Courses • Develop IR systems extending frameworks/tools ▫ IR Base [Calado et al., 2007] ▫ Alkaline, Greenstone, SpidersRUs [Chau et al., 2010] ▫ Lucene, Lemur • Students should focus on the concepts and not struggle with the tools [Madnani et al., 2008]
  • 6. 7 Current IR Courses (II) • Develop IR systems from scratch ▫ Build your search engine in 90 days [Chau et al., 2003] ▫ History Places [Hendry, 2007] • Only possible in technical majors ▫ Software development ▫ Algorithms ▫ Data structures ▫ Database management • Much more complicated than using other tools • Much better to understand the IR techniques and processes explained
  • 7. 8 Current IR Courses (III) • Students must learn the Scientific Method [IEEE/ACM Computing Curricula, 2001] • Not only get the numbers [Markey et al., 1978] ▫ Where do they come from? ▫ What do they really mean? ▫ Why they are calculated that way? • Analyze the reliability and validity • Not usually paid much attention in IR courses [Fernández-Luna et al., 2009]
  • 8. 9 Current IR Courses (and IV) • Modern CS industry is changing ▫ Big data • Real conditions differ widely from class [Braught et al., 2004] ▫ Interesting from the experimental viewpoint • Finding free collections to teach IR is hard ▫ Copyright ▫ Diversity • Usually stick to the same collections year after year ▫ Try to make them a product of the course itself
  • 9. 10 Our IR Course • Senior CS undergraduates ▫ So far: elective, 1 group of ~30 students ▫ From next year: mandatory, 2-3 groups of ~35 students • Exercise: very simple Web crawler [Urbano et al., 2010] • Search engine form scratch, in groups of 3-4 students ▫ Module 1: basic indexing and retrieval ▫ Module 2: query expansion ▫ Module 3: Named Entity Recognition • Skeleton and evaluation framework are provided • External libraries allowed for certain parts ▫ Stemmer, Wordnet, etc. • Microsoft .net, C#, SQL Server Express
  • 10. 12 Evaluation in Early TREC Editions Document collection, depending on the task, domain…
  • 11. 13 Evaluation in Early TREC Editions Document collection, depending on the task, domain… Relevance assessors, … retired analysts
  • 12. 14 Evaluation in Early TREC Editions Document collection, depending on the task, domain… Candidate … topics
  • 13. 15 Evaluation in Early TREC Editions Document collection, depending on the task, domain… … Topic difficulty?
  • 14. 16 Evaluation in Early TREC Editions Document collection, depending on the task, domain… … Organizers choose final ~50 topics
  • 15. 17 Evaluation in Early TREC Editions …
  • 16. 18 Evaluation in Early TREC Editions … Participants …
  • 17. 19 Evaluation in Early TREC Editions … Top 1000 … results per topic
  • 18. 20 Evaluation in Early TREC Editions … … Organizers
  • 19. 21 Evaluation in Early TREC Editions Top 100 results
  • 20. 22 Evaluation in Early TREC Editions Top 100 results Depth-100 pool Depending on overlapping, pool sizes vary (usually ~1/3)
  • 21. 23 Evaluation in Early TREC Editions Top 100 results Depth-100 pool Depending on overlapping, pool sizes vary (usually ~1/3) What documents are relevant?
  • 22. 24 Evaluation in Early TREC Editions Top 100 results Depth-100 pool Depending on overlapping, pool sizes vary (usually ~1/3) What documents are relevant? Relevance judgments Organizers
  • 23. 25 Evaluation in Early TREC Editions Top 100 results Depth-100 pool Depending on overlapping, pool sizes vary (usually ~1/3) What documents are relevant? Relevance judgments Organizers
  • 24. 26 Changes for the Course Document collection, can not be too large
  • 25. 27 Changes for the Course Document collection, can not be too large Topics … are restricted • Different themes = many diverse documents • Similar topics = fewer documents • Start with a theme common to all topics
  • 26. 28 Changes for the Course Write topics … with a common theme How do we get relevant documents?
  • 27. 29 Changes for the Course Write topics … with a common theme Try to answer the topic ourselves
  • 28. 30 Changes for the Course Write topics … with a common theme Try to answer the topic ourselves Too difficult? Too few documents?
  • 29. 31 Changes for the Course Write topics … with a common theme Try to answer the topic ourselves Too difficult? Too few documents? Complete document collection Final topics Crawler
  • 30. 32 Changes for the Course • Relevance judging would be next • Before students start developing their systems ▫ Cheating is very tempting: all (mine) relevant! • Reliable pools without the student systems? • Use other well-known systems: Lucene & Lemur ▫ These are the pooling systems ▫ Different techniques, similar to what we teach
  • 31. 33 Changes for the Course … Lucene Lemur Lucene … Lemur
  • 32. 34 Changes for the Course … Lucene Lemur Lucene … Lemur depth-k pools lead to very different sizes Some students would judge many more documents than others
  • 33. 35 Changes for the Course Topic 001 Topic 002 Topic 003 Topic 004 Topic 005 Topic N • Instead of depth-k, pool until size-k ▫ Size-100 in our case ▫ About 2 hours to judge, possible to do in class
  • 34. 36 Changes for the Course • Pooling systems might leave relevant material out ▫ Include Google’s top 10 results to each pool  When choosing topics we knew there were some • Students might do the judgments carelessly ▫ Crawl 2 noise topics with unrelated queries ▫ Add 10 noise documents to each pool and check • Now keep pooling documents until size-100 ▫ The union of the pools form the biased collection  This is the one we gave students
  • 35. 37 2010 Test Collection • 32 students, 5 faculty • Theme: Computing ▫ 20 topics (plus 2 for noise)  17 judged by two people ▫ 9,769 documents (735MB) in complete collection ▫ 1,967 documents (161MB) in biased collection • Pool sizes between 100 and 105 ▫ Average pool depth 27.5, from 15 to 58
  • 36. 38 Evaluation • With results and judgments, rank student systems • Compute mean NDCG scores [Järvelin et al., 2002] ▫ Are relevant documents properly ranked? ▫ Ranges between 0 (useless) to 1 (perfect) • The group with the best system is rewarded with an extra point (over 10) in the final grade ▫ About half the students really motivated for this
  • 37. 39 Is This Evaluation Method Reliable? • These evaluations have problems [Voorhees, 2002] ▫ Inconsistency of judgments [Voorhees, 2000]  People have different notions of «relevance»  Same document judged differently ▫ Incompleteness of judgments [Zobel, 1998]  Documents not in the pool are considered not relevant  What if some of them were actually relevant? • TREC ad hoc was quite reliable ▫ Many topics ▫ Great man-power
  • 38. 40 Assessor Agreement • 17 topics were judged by 2 people ▫ Mean Cohen’s Kappa = 0.417  From 0.096 to 0.735 ▫ Mean Precision-Recall = 0.657 - 0.638  From 0.111 to 1 • For TREC, observed 0.65 precision at 0.65 recall [Voorhees, 2000] • Only 1 noise document was judged relevant
  • 39. Mean NDCG@100 0.2 0.3 0.4 0.5 0.6 0.7 0.8 03-3 01-1 03-2 03-1 01-3 05-3 07-1 01-2 Inconsistency & 05-1 07-3 02-1 05-2 System Performance 07-2 08-1 Student system 04-1 02-3 08-2 02-2 08-3 04-2 04-3 06-3 06-1 Mean over 2,000 trels 06-2 41
  • 40. 42 Inconsistency & System Performance (and II) • Take 2,000 random samples of assessors ▫ Each would have been possible if using 1 assessor ▫ NDCG differences between 0.075 and 0.146 • 1.5 times larger than in TREC [Voorhees, 2000] ▫ Unexperienced assessors ▫ 3-point relevance scale ▫ Larger values to begin with! (x2)  Percentage differences are lower
  • 41. 43 Inconsistency & System Ranking • Absolute numbers do not tell much, rankings do ▫ It is highly dependent on the collection • Compute correlations with 2,000 random samples ▫ Mean Kendall’s tau = 0.926  From 0.811 to 1 • For TREC, observed 0.938 [Voorhees, 2000] • No swap was significant (Wilcoxon, α=0.05)
  • 42. 44 Incompleteness & System Performance 16 Increment in Mean NDCG@100 (%) Mean difference over 2,000 trels 14 Estimated difference 12 10 8 6 4 2 0 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 Pool size
  • 43. 45 Incompleteness & System Performance (and II) • Start with size-20 pools and keep adding 5 more ▫ 20 to 25: NDCG variations between 14% and 37% ▫ 95 to 100: between 0.4% and 1.6%, mean 1% • In TREC editions, between 0.5% and 3.5% [Zobel, 1998] ▫ Even some observations of 19%! • We achieve these levels for sizes 60-65
  • 44. 46 Incompleteness & Effort • Extrapolate to larger pools ▫ If differences decrease a lot, judge a little more ▫ If not, judge fewer documents but more topics • 135 documents for mean NDCG differences <0.5% ▫ Not really worth the +35% effort
  • 45. 47 Student Perception • Brings more work both for faculty and students • More technical problems than previous years ▫ Survey scores dropped from 3.27 to 2.82 over 5 ▫ Expected, they had to work considerably more ▫ Possible gaps in (our) CS program? • Same satisfaction scores as previous years ▫ Very appealing for us ▫ This was our first year, lots of logistics problems
  • 46. 48 Conclusions & Future Work • IR deserves more attention in CS education • Laboratory experiments deserve it too • Focus on the experimental nature of IR ▫ But give a wider perspective useful beyond IR • Much more interesting and instructive • Adapt well-known methodologies in industry/academia to our limited resources
  • 47. 49 Conclusions & Future Work (II) • Improve the methodology ▫ Explore quality control techniques in crowdsourcing • Integrate with other courses ▫ Computer Science ▫ Library Science • Make everything public, free ▫ Tets collections and reports ▫ http://ir.kr.inf.uc3m.es/eirex/
  • 48. 50 Conclusions & Future Work (and III) • We call all this EIREX ▫ Information Retrieval Education through Experimentation • We would like others to participate ▫ Use it in your class  Collections  The methodology  Tools
  • 49.
  • 50. 52 2011 Semester • Only 19 students • The 2010 collection used for testing ▫ Every year there will be more variety to choose • 2011 collection: Crowdsourcing ▫ 23 topics, better described  Created by the students  Judgments by 1 student ▫ 13,245 documents (952MB) in complete collection ▫ 2,088 documents (96MB) in biased collection • Overview report coming soon!
  • 51. 53 Can we reliably adapt the large-scale methodologies of academic/industrial IR evaluation workshops? YES (we can)