SlideShare uma empresa Scribd logo
1 de 44
Chapter 1 : Overview of
Information Retrieval
Adama Science and Technology University
School of Electrical Engineering and Computing
Department of CSE
Dr. Mesfin Abebe Haile (2024)
Information Retrieval
 Information retrieval (IR) is the process of finding relevant
documents that satisfies information need of users from large
collections of unstructured text.
 General Goal of Information Retrieval:
To help users to find useful information based on their
information needs (with a minimum effort) despite:
Increasing complexity of Information,
Changing needs of user,
 Provide immediate random access to the document collection.
4/9/2024 2
Document Corpus
 Large collections of documents from various sources: news
articles, research papers, books, digital libraries, Web pages,
etc.
Sample Statistics of Text Collections:
 Dialog: (http://www.dialog.com/)
Claims to have more than 20 terabytes of data in > 600
Databases, > 1 Billion unique records.
 LEXIS/NEXIS: (http://www.lexisnexis.com/)
Claims 7 terabytes, 1.7 billion documents, 1.5 million
subscribers, 11,400 databases; > 200,000 searches per day; 9
mainframes, 300 Unix servers, 200 NT servers.
4/9/2024 3
Document Corpus
 Large collections of documents from various sources: news
articles, research papers, books, digital libraries, Web pages,
etc.
 TREC (Text REtrieval Conference) collections:
It is an annual information retrieval conference & competition.
Total of about 10 GBs of dataset of text for IR evaluation.
 Web Search Engines:
Google claim to index over 3 billion pages.
4/9/2024 4
Information Retrieval Systems ?
 Document (Web page) retrieval in
response to a query.
Quite effective (at some things)
Commercially successful (some of them)
 But what goes on behind the scenes?
How do they work?
What happens beyond the Web?
 Web search systems:
Lycos, Excite, Yahoo, Google, Live,
Northern Light, Teoma, HotBot, Baidu,
…
4/9/2024 5
Web Search Engines
 There are more than 2,000 general web search engines.
 The big four are Google, Yahoo!, Live Search, Ask.
Scientific research & selected journals search engine: Scirus,
About.
Meta search engine: Search.com, Searchhippo, Searchthe.net,
Windseek, Web-search, Webcrawler, Mamma, Ixquick,
AllPlus, Fazzle, Jux2
Multimedia search engine: Blinkx
Visual search engine: Ujiko, Web Brain, RedZee, Kartoo,
Mooter
Audio/sound search engine: Feedster, Findsounds
video search engine: YouTube, Trooker
–Medical search engine: Search Medica, Healia,
Omnimedicalsearch,
4/9/2024 6
Web Search Engines
 There are more than 2,000 general web search engines.
 The big four are Google, Yahoo!, Live Search, Ask.
Index/Directory: Sunsteam, Supercrawler, Thunderstone,
Thenet1, Webworldindex, Smartlinks, Whatusee, Re-quest,
DMOZ, Searchtheweb
Others: Lycos, Excite, Altavista, AOL Search, Intute, Accoona,
Jayde, Hotbot, InfoMine, Slider, Selectsurf, Questfinder, Kazazz,
Answers, Factbites, Alltheweb
 There are also Virtual Libraries: Pinakes, WWW Virtual
Library, Digital-librarian, Librarians Internet Index.
4/9/2024 7
Structure of an IR System
 An Information Retrieval System serves as a bridge between
the world of authors and the world of readers/users.
 Writers present a set of ideas in a document using a set of
concepts.
 Then Users seek the IR system for relevant documents that
satisfy their information need.
 What is in the Black Box?
 The black box is the processing part of the information retrieval
system.
Black box
User Documents
4/9/2024 8
Information Retrieval vs. Data
Retrieval
 Example of data retrieval system is a relational database.
Data Retrieval Info Retrieval
Data
organization
Structured (Clear Semantics: Name,
age…)
Unstructured (No fields
(other than text)
Query
Language
Artificial (defined, SQL) Free text (“natural
language”), Boolean
Query
specification
Complete Incomplete
Items wanted Exact Matching Partial & Best
matching, Relevant
Accuracy 100 % (results are always “correct”) < 50 %
Error response Sensitive Insensitive
4/9/2024 9
Typical IR Task
 Given:
 A corpus of document collections (text, image, video, audio)
published by various authors.
 A user information need in the form of a query.
 An IR system searches for:
 A ranked set of documents that are relevant to satisfy
information need of a user.
4/9/2024 10
Typical IR System Architecture
IR
System
Query
String
Document
corpus
Ranked
Documents
1. Doc1
2. Doc2
3. Doc3
.
.
4/9/2024 11
Web Search System
Document
corpus
Query
String
IR
System
Ranked
Documents
1. Page1
2. Page2
3. Page3
.
.
Web Spider
4/9/2024 12
Overview of the Retrieval Process
4/9/2024 13
Issues that arise in IR
 Text representation:
 What makes a “good” representation? The use of free-text or content-bearing
index-terms?
 How is a representation generated from text?
 What are retrievable objects and how are they organized?
 Information needs representation:
 What is an appropriate query language?
 How can interactive query formulation and refinement be supported?
 Comparing representations:
 What is a “good” model of retrieval?
 How is uncertainty represented?
 Evaluating effectiveness of retrieval:
 What are good metrics?
 What constitutes a good experimental test bed?
4/9/2024 14
User
Interface
Text Operations
Formulate
Query
Indexing
Searching
Ranking
Index
file
Text
Query
User
need
User
feedback
Ranked docs
Retrieved docs
Logical view
Logical view
Inverted file Text
Database
Detail View of the Retrieval Process
4/9/2024 15
Focus in IR System Design
 In improving performance effectiveness of the system.
 Effectiveness of the system is evaluated in terms of precision,
recall, …
 Stemming, stop words, weighting schemes, matching
algorithms.
 In improving performance efficiency. The concern here is:
 Storage space usage, access time, …
 Compression, data/file structures, space – time tradeoffs
4/9/2024 16
Subsystems of IR system
 The two subsystems of an IR system:
 Indexing:
 Is an offline process of organizing documents using keywords
extracted from the collection.
 Indexing is used to speed up access to desired information from
document collection as per users query.
 Searching:
 Is an online process that scans document corpus to find relevant
documents that matches users query.
4/9/2024 17
Statistical Properties of Text
 How is the frequency of different words distributed?
 A few words are very common.
 2 most frequent words (e.g. “the”, “of”) can account for
about 10% of word occurrences.
 Most words are very rare.
 Half the words in a corpus appear only once, called “read
only once”.
 How fast does vocabulary size grow with the size of a corpus?
 Such factors affect the performance of IR system & can be used
to select suitable term weights & other aspects of the system.
4/9/2024 18
Text Operations
 Not all words in a document are equally significant to represent
the contents/meanings of a document.
 Some word carry more meaning than others.
 Noun words are the most representative of a document content.
 Therefore, need to preprocess the text of a document in a
collection to be used as index terms.
 Text operations is the process of text transformations in to
logical representations.
 It (text operation) generated a set of index terms.
4/9/2024 19
Text Operations
 Main operations for selecting index terms:
 Tokenization: identify a set of words used to describe the content
of text document.
 Stop words removal: filter out frequently appearing words.
 Stemming words: remove prefixes, infixes & suffixes.
 Design term categorization structures (like thesaurus), which
captures relationship for allowing the expansion of the original
query with related terms.
4/9/2024 20
Indexing Subsystem
Documents
Tokenize
Stop list
Stemming & Normalize
Term weighting
Index
text
non-stop list tokens
tokens
stemmed terms
Weighted terms
Assign document identifier
documents
document IDs
4/9/2024 21
Example: Indexing
countrymen
Tokenizer
Token
stream. Friends Romans
Stemmer and
Normalizer
Modified
tokens.
friend roman countryman
Indexer
Index File
(Inverted file).
Documents to
be indexed.
Friends, Romans, countrymen.
friend
roman
countryman
2 4
2
13 16
1
4/9/2024 22
Index File
 An index file consists of records, called index entries.
 Index files are much smaller than the original file.
 For 1 GB of TREC text collection the vocabulary has a size of
only 5 MB (Ref: Baeza-Yates and Ribeiro-Neto, 2005)
 This size may be further reduced by Linguistic pre-processing
(like stemming & other normalization methods).
 The usual unit for text indexing is a word.
 Index terms - are used to look up records in a file.
 Index file usually has index terms in a sorted order.
 The sort order of the terms in the index file provides an order on
a physical file.
4/9/2024 23
Building Index file
 An index file of a document is a file consisting of a list of index
terms and a link to one or more documents that has the index term.
 A good index file maps each keyword Ki to a set of documents Di
that contain the keyword.
 An index file is list of search terms that are organized for
associative look-up, i.e., to answer user’s query:
 In which documents does a specified search term appear?
 Where within each document does each term appear?
 For organizing index file for a collection of documents, there are
various options available:
 Decide what data structure and/or file structure to use. Is it
sequential file, inverted file, suffix array, signature file, etc. ?
4/9/2024 24
Searching Subsystem
stemmed terms
Index file
query
Parse query
Stemming & Normalize
Stop list
non-stop list tokens
query tokens
Similarity
Measure
Ranking
Index terms
Ranked
document set
Relevant
document set
Term weighting
Query
terms
4/9/2024 25
IR Models - Basic Concepts
One central problem regarding IR systems is the issue of
predicting which documents are relevant and which are not.
 Such a decision is usually dependent on a ranking algorithm
which attempts to establish a simple ordering of the documents
retrieved.
 Documents appearning at the top of this ordering are considered
to be more likely to be relevant.
Thus ranking algorithms are at the core of IR systems.
 The IR models determine the predictions of what is relevant and
what is not, based on the notion of relevance implemented by the
system.
4/9/2024 26
IR Models - Basic Concepts
After preprocessing, N distinct terms remain which are
Unique terms that form the VOCABULARY.
Let ki be an index term i & dj be a document j.
Each term, i, in a document or query j, is given a real-valued
weight, wij.
 wij is a weight associated with (ki,dj). If wij = 0, it indicates
that term does not belong to document dj.
The weight wij quantifies the importance of the index term for
describing the document contents.
 vec(dj) = (w1j, w2j, …, wtj) is a weighted vector associated with
the document dj.
4/9/2024 27
Mapping Documents & Queries
 Represent both documents & queries as N-dimensional vectors in
a term-document matrix, which shows occurrence of terms in the
document collection/query.
 E.g.
 An entry in the matrix corresponds to the “weight” of a term in
the document; zero means the term doesn’t exist in the document.
)
,...,
,
(
);
,...,
,
( ,
,
2
,
1
,
,
2
,
1 k
N
k
k
k
j
N
j
j
j t
t
t
q
t
t
t
d 



T1 T2 …. TN
D1 w11 w12 … w1N
D2 w21 w22 … w2N
: : : :
: : : :
DM wM1 wM2 … wMN
 Document collection is mapped to
term-by-document matrix.
 View as vector in multidimensional
space.
 Nearby vectors are related.
4/9/2024 28
IR Models: Matching function
 IR models measure the similarity between documents and
queries.
 Matching function is a mechanism used to match query with a set
of documents.
 For example, the vector space model considers documents and
queries as vectors in term-space and measures the similarity of the
document to the query.
 Techniques for matching include dot-product, cosine similarity,
dynamic programming…
4/9/2024 29
IR Models
 A number of major models have been developed to retrieve
information:
 The Boolean model,
 The vector space model,
 The probabilistic model, and
 Other models.
 Boolean model: is often referred to as the "exact match" model;
 Others are the "best match" models.
4/9/2024 30
d1
d2
d3
d4 d5
d6
d7
d8
k1
k2
k3
Generate the relevant documents retrieved by the Boolean model
for the query:
 q = k1  (k2  k3)
The Boolean Model: Example
4/9/2024 31
IR System Evaluation?
It provides the ability to measure the difference between IR
systems.
 How well do our search engines work?
 Is system A better than B?
 Under what conditions?
Evaluation drives what to research:
 Identify techniques that work and do not work,
 There are many retrieval models/ algorithms/ systems.
 Which one is the best?
What is the best method for:
 Similarity measures (dot-product, cosine, …)
 Index term selection (stop-word removal, stemming…)
 Term weighting (TF, TF-IDF,…)
4/9/2024 32
Types of Evaluation Strategies
System-centered studies:
 Given documents, queries, and relevance judgments.
 Try several variations of the system.
 Measure which system returns the “best” hit list.
User-centered studies:
 Given several users, and at least two retrieval systems.
 Have each user try the same task on both systems.
 Measure which system satisfy the “best” for users
information need.
4/9/2024 33
Evaluation Criteria
What are some main measures for evaluating an IR system’s
performance?
Measure effectiveness of the system:
 How is a system capable of retrieving relevant documents from
the collection?
 Is a system better than another one?
 User satisfaction: How “good” are the documents that are
returned as a response to user query?
 “Relevance” of results to meet information need of users.
4/9/2024 34
Retrieval scenario
= Relevant
document
A.
B.
C.
D.
E.
F.
The scenario where 13 results retrieved by different search engine
for a given query. Which search engine you prefer? Why?
4/9/2024 35
= Irrelevant
document
Measuring Retrieval Effectiveness
Metrics often used to evaluate effectiveness of the system.
Recall:
 Is percentage of relevant documents retrieved from the database
in response to users query. (A / A + C)
Precision:
 Is percentage of retrieved documents that are relevant to the
query. (A / A + B)
Relevant Irrelevant
Retrieved
Not retrieved
A B
C D
4/9/2024 36
Query Language
How users query?
 The basic IR approach is Keyword-based search.
 Queries are combinations of words.
The document collection is searched for documents that
contain these words.
Word queries are intuitive, easy to express and provide fast
ranking.
There are different query language:
 Single query,
 Multiple query,
 Boolean query, .... etc
4/9/2024 37
Problems with Keywords
May not retrieve relevant documents that include Synonymous
terms (words with similar meaning).
 “restaurant” vs. “café”
 “Ethiopia” vs. “Abyssinia”
 “Car” vs. “automobile”
 “Buy” vs. “purchase”
 “Movie” vs. “film”
May retrieve irrelevant documents that include Polysemy terms
(terms with multiple meaning).
 “Apple” (company vs. fruit)
 “Bit” (unit of data vs. act of eating)
 “Bat” (baseball vs. mammal)
 “Bank” (financial institution vs. river bank)
4/9/2024 38
Relevance Feedback
After initial retrieval results are presented, allow the user to
provide feedback on the relevance of one or more of the
retrieved documents.
Use this feedback information to reformulate the query.
Produce new results based on reformulated query.
Allows more interactive, multi-pass process.
Relevance feedback can be automated in such a way that it
allows:
 Users relevance feedback,
 Pseudo relevance feedback.
4/9/2024 39
Users Relevance Feedback
Architecture
1. Doc1 
2. Doc2 
3. Doc3 
.
.
Feedback
Rankings
IR
System
Document
corpus
Ranked
Documents
1. Doc1
2. Doc2
3. Doc3
.
.
Query
String
Revised
Query
ReRanked
Documents
1. Doc2
2. Doc1
3. Doc4
.
.
Query
Reformulation
4/9/2024 40
Challenges for IR researchers and
practitioners
Technical challenge: what tools should IR systems provide to
allow effective and efficient manipulation of information within
such diverse media of text, image, video and audio?
Interaction challenge: what features should IR systems provide
in order to support a wide variety of users in their search for
relevant information.
Evaluation challenge: how can we measure effectiveness of
retrieval? which tools and features are effective and usable,
given the increasing diversity of end-users and information
seeking situations?
4/9/2024 41
Assignments - One
Pick three of the following concept (which is not taken by
other students). Review literatures (books, articles & Internet)
(concerning the meaning, function, pros and cons &
application of the concept).
1. Information Retrieval
2. Search engine
3. Data retrieval
4. Cross language IR
5. Multilingual IR
6. Document image retrieval
7. Indexing
8. Tokenization
9. Stemming
10. Stop words
11. Normalization
12. Thesaurus
13. Searching
14. IR models
15. Term weighting
16. Similarity measurement
17. Retrieval effectiveness
18.Query language
19. Relevance feedback
20. Query Expansion
4/9/2024 42
Question & Answer
4/9/2024 43
Thank You !!!
4/9/2024 44

Mais conteúdo relacionado

Semelhante a chapter 1-Overview of Information Retrieval.ppt

An Improved Mining Of Biomedical Data From Web Documents Using Clustering
An Improved Mining Of Biomedical Data From Web Documents Using ClusteringAn Improved Mining Of Biomedical Data From Web Documents Using Clustering
An Improved Mining Of Biomedical Data From Web Documents Using ClusteringKelly Lipiec
 
Information retrieval is the process of accessing data resources. Usually doc...
Information retrieval is the process of accessing data resources. Usually doc...Information retrieval is the process of accessing data resources. Usually doc...
Information retrieval is the process of accessing data resources. Usually doc...NALESVPMEngg
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notesAnandh Arumugakan
 
An Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured DataAn Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured DataMelinda Watson
 
Neuroinformatics_Databses_Ontologies_Federated Database.pptx
Neuroinformatics_Databses_Ontologies_Federated Database.pptxNeuroinformatics_Databses_Ontologies_Federated Database.pptx
Neuroinformatics_Databses_Ontologies_Federated Database.pptxJagannath University
 
Neuroinformatics Databases Ontologies Federated Database.pptx
Neuroinformatics Databases Ontologies Federated Database.pptxNeuroinformatics Databases Ontologies Federated Database.pptx
Neuroinformatics Databases Ontologies Federated Database.pptxJagannath University
 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introductionnimmyjans4
 
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUES
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUESMULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUES
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUESijcseit
 
Chapter 1.pptx
Chapter 1.pptxChapter 1.pptx
Chapter 1.pptxHabtamu100
 
Inteligent Catalogue Final
Inteligent Catalogue FinalInteligent Catalogue Final
Inteligent Catalogue Finalguestcaef1d
 
OpenAthens Conference 2018 - Tim Lull and Chad Smith - Cultivating your onlin...
OpenAthens Conference 2018 - Tim Lull and Chad Smith - Cultivating your onlin...OpenAthens Conference 2018 - Tim Lull and Chad Smith - Cultivating your onlin...
OpenAthens Conference 2018 - Tim Lull and Chad Smith - Cultivating your onlin...OpenAthens
 
Information retrieval-systems notes
Information retrieval-systems notesInformation retrieval-systems notes
Information retrieval-systems notesBAIRAVI T
 
Text databases and information retrieval
Text databases and information retrievalText databases and information retrieval
Text databases and information retrievalunyil96
 
Context Based Web Indexing For Semantic Web
Context Based Web Indexing For Semantic WebContext Based Web Indexing For Semantic Web
Context Based Web Indexing For Semantic WebIOSR Journals
 
Enterprise Search Share Point2009 Best Practices Final
Enterprise Search Share Point2009 Best Practices FinalEnterprise Search Share Point2009 Best Practices Final
Enterprise Search Share Point2009 Best Practices FinalMarianne Sweeny
 
You have collected the following documents (unstructured) and pl.docx
You have collected the following documents (unstructured) and pl.docxYou have collected the following documents (unstructured) and pl.docx
You have collected the following documents (unstructured) and pl.docxbriancrawford30935
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibEl Habib NFAOUI
 

Semelhante a chapter 1-Overview of Information Retrieval.ppt (20)

An Improved Mining Of Biomedical Data From Web Documents Using Clustering
An Improved Mining Of Biomedical Data From Web Documents Using ClusteringAn Improved Mining Of Biomedical Data From Web Documents Using Clustering
An Improved Mining Of Biomedical Data From Web Documents Using Clustering
 
Information retrieval is the process of accessing data resources. Usually doc...
Information retrieval is the process of accessing data resources. Usually doc...Information retrieval is the process of accessing data resources. Usually doc...
Information retrieval is the process of accessing data resources. Usually doc...
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
 
An Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured DataAn Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured Data
 
Hci
HciHci
Hci
 
Neuroinformatics_Databses_Ontologies_Federated Database.pptx
Neuroinformatics_Databses_Ontologies_Federated Database.pptxNeuroinformatics_Databses_Ontologies_Federated Database.pptx
Neuroinformatics_Databses_Ontologies_Federated Database.pptx
 
Neuroinformatics Databases Ontologies Federated Database.pptx
Neuroinformatics Databases Ontologies Federated Database.pptxNeuroinformatics Databases Ontologies Federated Database.pptx
Neuroinformatics Databases Ontologies Federated Database.pptx
 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introduction
 
Research data life cycle
Research data life cycleResearch data life cycle
Research data life cycle
 
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUES
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUESMULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUES
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUES
 
Chapter 1.pptx
Chapter 1.pptxChapter 1.pptx
Chapter 1.pptx
 
Inteligent Catalogue Final
Inteligent Catalogue FinalInteligent Catalogue Final
Inteligent Catalogue Final
 
OpenAthens Conference 2018 - Tim Lull and Chad Smith - Cultivating your onlin...
OpenAthens Conference 2018 - Tim Lull and Chad Smith - Cultivating your onlin...OpenAthens Conference 2018 - Tim Lull and Chad Smith - Cultivating your onlin...
OpenAthens Conference 2018 - Tim Lull and Chad Smith - Cultivating your onlin...
 
Information retrieval-systems notes
Information retrieval-systems notesInformation retrieval-systems notes
Information retrieval-systems notes
 
Text databases and information retrieval
Text databases and information retrievalText databases and information retrieval
Text databases and information retrieval
 
Context Based Web Indexing For Semantic Web
Context Based Web Indexing For Semantic WebContext Based Web Indexing For Semantic Web
Context Based Web Indexing For Semantic Web
 
Enterprise Search Share Point2009 Best Practices Final
Enterprise Search Share Point2009 Best Practices FinalEnterprise Search Share Point2009 Best Practices Final
Enterprise Search Share Point2009 Best Practices Final
 
You have collected the following documents (unstructured) and pl.docx
You have collected the following documents (unstructured) and pl.docxYou have collected the following documents (unstructured) and pl.docx
You have collected the following documents (unstructured) and pl.docx
 
Project literature search
Project literature searchProject literature search
Project literature search
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_Habib
 

Último

Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
Science lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lessonScience lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lessonJericReyAuditor
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxsocialsciencegdgrohi
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxUnboundStockton
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfakmcokerachita
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfadityarao40181
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 

Último (20)

Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
Science lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lessonScience lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lesson
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docx
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdf
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdf
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 

chapter 1-Overview of Information Retrieval.ppt

  • 1. Chapter 1 : Overview of Information Retrieval Adama Science and Technology University School of Electrical Engineering and Computing Department of CSE Dr. Mesfin Abebe Haile (2024)
  • 2. Information Retrieval  Information retrieval (IR) is the process of finding relevant documents that satisfies information need of users from large collections of unstructured text.  General Goal of Information Retrieval: To help users to find useful information based on their information needs (with a minimum effort) despite: Increasing complexity of Information, Changing needs of user,  Provide immediate random access to the document collection. 4/9/2024 2
  • 3. Document Corpus  Large collections of documents from various sources: news articles, research papers, books, digital libraries, Web pages, etc. Sample Statistics of Text Collections:  Dialog: (http://www.dialog.com/) Claims to have more than 20 terabytes of data in > 600 Databases, > 1 Billion unique records.  LEXIS/NEXIS: (http://www.lexisnexis.com/) Claims 7 terabytes, 1.7 billion documents, 1.5 million subscribers, 11,400 databases; > 200,000 searches per day; 9 mainframes, 300 Unix servers, 200 NT servers. 4/9/2024 3
  • 4. Document Corpus  Large collections of documents from various sources: news articles, research papers, books, digital libraries, Web pages, etc.  TREC (Text REtrieval Conference) collections: It is an annual information retrieval conference & competition. Total of about 10 GBs of dataset of text for IR evaluation.  Web Search Engines: Google claim to index over 3 billion pages. 4/9/2024 4
  • 5. Information Retrieval Systems ?  Document (Web page) retrieval in response to a query. Quite effective (at some things) Commercially successful (some of them)  But what goes on behind the scenes? How do they work? What happens beyond the Web?  Web search systems: Lycos, Excite, Yahoo, Google, Live, Northern Light, Teoma, HotBot, Baidu, … 4/9/2024 5
  • 6. Web Search Engines  There are more than 2,000 general web search engines.  The big four are Google, Yahoo!, Live Search, Ask. Scientific research & selected journals search engine: Scirus, About. Meta search engine: Search.com, Searchhippo, Searchthe.net, Windseek, Web-search, Webcrawler, Mamma, Ixquick, AllPlus, Fazzle, Jux2 Multimedia search engine: Blinkx Visual search engine: Ujiko, Web Brain, RedZee, Kartoo, Mooter Audio/sound search engine: Feedster, Findsounds video search engine: YouTube, Trooker –Medical search engine: Search Medica, Healia, Omnimedicalsearch, 4/9/2024 6
  • 7. Web Search Engines  There are more than 2,000 general web search engines.  The big four are Google, Yahoo!, Live Search, Ask. Index/Directory: Sunsteam, Supercrawler, Thunderstone, Thenet1, Webworldindex, Smartlinks, Whatusee, Re-quest, DMOZ, Searchtheweb Others: Lycos, Excite, Altavista, AOL Search, Intute, Accoona, Jayde, Hotbot, InfoMine, Slider, Selectsurf, Questfinder, Kazazz, Answers, Factbites, Alltheweb  There are also Virtual Libraries: Pinakes, WWW Virtual Library, Digital-librarian, Librarians Internet Index. 4/9/2024 7
  • 8. Structure of an IR System  An Information Retrieval System serves as a bridge between the world of authors and the world of readers/users.  Writers present a set of ideas in a document using a set of concepts.  Then Users seek the IR system for relevant documents that satisfy their information need.  What is in the Black Box?  The black box is the processing part of the information retrieval system. Black box User Documents 4/9/2024 8
  • 9. Information Retrieval vs. Data Retrieval  Example of data retrieval system is a relational database. Data Retrieval Info Retrieval Data organization Structured (Clear Semantics: Name, age…) Unstructured (No fields (other than text) Query Language Artificial (defined, SQL) Free text (“natural language”), Boolean Query specification Complete Incomplete Items wanted Exact Matching Partial & Best matching, Relevant Accuracy 100 % (results are always “correct”) < 50 % Error response Sensitive Insensitive 4/9/2024 9
  • 10. Typical IR Task  Given:  A corpus of document collections (text, image, video, audio) published by various authors.  A user information need in the form of a query.  An IR system searches for:  A ranked set of documents that are relevant to satisfy information need of a user. 4/9/2024 10
  • 11. Typical IR System Architecture IR System Query String Document corpus Ranked Documents 1. Doc1 2. Doc2 3. Doc3 . . 4/9/2024 11
  • 12. Web Search System Document corpus Query String IR System Ranked Documents 1. Page1 2. Page2 3. Page3 . . Web Spider 4/9/2024 12
  • 13. Overview of the Retrieval Process 4/9/2024 13
  • 14. Issues that arise in IR  Text representation:  What makes a “good” representation? The use of free-text or content-bearing index-terms?  How is a representation generated from text?  What are retrievable objects and how are they organized?  Information needs representation:  What is an appropriate query language?  How can interactive query formulation and refinement be supported?  Comparing representations:  What is a “good” model of retrieval?  How is uncertainty represented?  Evaluating effectiveness of retrieval:  What are good metrics?  What constitutes a good experimental test bed? 4/9/2024 14
  • 15. User Interface Text Operations Formulate Query Indexing Searching Ranking Index file Text Query User need User feedback Ranked docs Retrieved docs Logical view Logical view Inverted file Text Database Detail View of the Retrieval Process 4/9/2024 15
  • 16. Focus in IR System Design  In improving performance effectiveness of the system.  Effectiveness of the system is evaluated in terms of precision, recall, …  Stemming, stop words, weighting schemes, matching algorithms.  In improving performance efficiency. The concern here is:  Storage space usage, access time, …  Compression, data/file structures, space – time tradeoffs 4/9/2024 16
  • 17. Subsystems of IR system  The two subsystems of an IR system:  Indexing:  Is an offline process of organizing documents using keywords extracted from the collection.  Indexing is used to speed up access to desired information from document collection as per users query.  Searching:  Is an online process that scans document corpus to find relevant documents that matches users query. 4/9/2024 17
  • 18. Statistical Properties of Text  How is the frequency of different words distributed?  A few words are very common.  2 most frequent words (e.g. “the”, “of”) can account for about 10% of word occurrences.  Most words are very rare.  Half the words in a corpus appear only once, called “read only once”.  How fast does vocabulary size grow with the size of a corpus?  Such factors affect the performance of IR system & can be used to select suitable term weights & other aspects of the system. 4/9/2024 18
  • 19. Text Operations  Not all words in a document are equally significant to represent the contents/meanings of a document.  Some word carry more meaning than others.  Noun words are the most representative of a document content.  Therefore, need to preprocess the text of a document in a collection to be used as index terms.  Text operations is the process of text transformations in to logical representations.  It (text operation) generated a set of index terms. 4/9/2024 19
  • 20. Text Operations  Main operations for selecting index terms:  Tokenization: identify a set of words used to describe the content of text document.  Stop words removal: filter out frequently appearing words.  Stemming words: remove prefixes, infixes & suffixes.  Design term categorization structures (like thesaurus), which captures relationship for allowing the expansion of the original query with related terms. 4/9/2024 20
  • 21. Indexing Subsystem Documents Tokenize Stop list Stemming & Normalize Term weighting Index text non-stop list tokens tokens stemmed terms Weighted terms Assign document identifier documents document IDs 4/9/2024 21
  • 22. Example: Indexing countrymen Tokenizer Token stream. Friends Romans Stemmer and Normalizer Modified tokens. friend roman countryman Indexer Index File (Inverted file). Documents to be indexed. Friends, Romans, countrymen. friend roman countryman 2 4 2 13 16 1 4/9/2024 22
  • 23. Index File  An index file consists of records, called index entries.  Index files are much smaller than the original file.  For 1 GB of TREC text collection the vocabulary has a size of only 5 MB (Ref: Baeza-Yates and Ribeiro-Neto, 2005)  This size may be further reduced by Linguistic pre-processing (like stemming & other normalization methods).  The usual unit for text indexing is a word.  Index terms - are used to look up records in a file.  Index file usually has index terms in a sorted order.  The sort order of the terms in the index file provides an order on a physical file. 4/9/2024 23
  • 24. Building Index file  An index file of a document is a file consisting of a list of index terms and a link to one or more documents that has the index term.  A good index file maps each keyword Ki to a set of documents Di that contain the keyword.  An index file is list of search terms that are organized for associative look-up, i.e., to answer user’s query:  In which documents does a specified search term appear?  Where within each document does each term appear?  For organizing index file for a collection of documents, there are various options available:  Decide what data structure and/or file structure to use. Is it sequential file, inverted file, suffix array, signature file, etc. ? 4/9/2024 24
  • 25. Searching Subsystem stemmed terms Index file query Parse query Stemming & Normalize Stop list non-stop list tokens query tokens Similarity Measure Ranking Index terms Ranked document set Relevant document set Term weighting Query terms 4/9/2024 25
  • 26. IR Models - Basic Concepts One central problem regarding IR systems is the issue of predicting which documents are relevant and which are not.  Such a decision is usually dependent on a ranking algorithm which attempts to establish a simple ordering of the documents retrieved.  Documents appearning at the top of this ordering are considered to be more likely to be relevant. Thus ranking algorithms are at the core of IR systems.  The IR models determine the predictions of what is relevant and what is not, based on the notion of relevance implemented by the system. 4/9/2024 26
  • 27. IR Models - Basic Concepts After preprocessing, N distinct terms remain which are Unique terms that form the VOCABULARY. Let ki be an index term i & dj be a document j. Each term, i, in a document or query j, is given a real-valued weight, wij.  wij is a weight associated with (ki,dj). If wij = 0, it indicates that term does not belong to document dj. The weight wij quantifies the importance of the index term for describing the document contents.  vec(dj) = (w1j, w2j, …, wtj) is a weighted vector associated with the document dj. 4/9/2024 27
  • 28. Mapping Documents & Queries  Represent both documents & queries as N-dimensional vectors in a term-document matrix, which shows occurrence of terms in the document collection/query.  E.g.  An entry in the matrix corresponds to the “weight” of a term in the document; zero means the term doesn’t exist in the document. ) ,..., , ( ); ,..., , ( , , 2 , 1 , , 2 , 1 k N k k k j N j j j t t t q t t t d     T1 T2 …. TN D1 w11 w12 … w1N D2 w21 w22 … w2N : : : : : : : : DM wM1 wM2 … wMN  Document collection is mapped to term-by-document matrix.  View as vector in multidimensional space.  Nearby vectors are related. 4/9/2024 28
  • 29. IR Models: Matching function  IR models measure the similarity between documents and queries.  Matching function is a mechanism used to match query with a set of documents.  For example, the vector space model considers documents and queries as vectors in term-space and measures the similarity of the document to the query.  Techniques for matching include dot-product, cosine similarity, dynamic programming… 4/9/2024 29
  • 30. IR Models  A number of major models have been developed to retrieve information:  The Boolean model,  The vector space model,  The probabilistic model, and  Other models.  Boolean model: is often referred to as the "exact match" model;  Others are the "best match" models. 4/9/2024 30
  • 31. d1 d2 d3 d4 d5 d6 d7 d8 k1 k2 k3 Generate the relevant documents retrieved by the Boolean model for the query:  q = k1  (k2  k3) The Boolean Model: Example 4/9/2024 31
  • 32. IR System Evaluation? It provides the ability to measure the difference between IR systems.  How well do our search engines work?  Is system A better than B?  Under what conditions? Evaluation drives what to research:  Identify techniques that work and do not work,  There are many retrieval models/ algorithms/ systems.  Which one is the best? What is the best method for:  Similarity measures (dot-product, cosine, …)  Index term selection (stop-word removal, stemming…)  Term weighting (TF, TF-IDF,…) 4/9/2024 32
  • 33. Types of Evaluation Strategies System-centered studies:  Given documents, queries, and relevance judgments.  Try several variations of the system.  Measure which system returns the “best” hit list. User-centered studies:  Given several users, and at least two retrieval systems.  Have each user try the same task on both systems.  Measure which system satisfy the “best” for users information need. 4/9/2024 33
  • 34. Evaluation Criteria What are some main measures for evaluating an IR system’s performance? Measure effectiveness of the system:  How is a system capable of retrieving relevant documents from the collection?  Is a system better than another one?  User satisfaction: How “good” are the documents that are returned as a response to user query?  “Relevance” of results to meet information need of users. 4/9/2024 34
  • 35. Retrieval scenario = Relevant document A. B. C. D. E. F. The scenario where 13 results retrieved by different search engine for a given query. Which search engine you prefer? Why? 4/9/2024 35 = Irrelevant document
  • 36. Measuring Retrieval Effectiveness Metrics often used to evaluate effectiveness of the system. Recall:  Is percentage of relevant documents retrieved from the database in response to users query. (A / A + C) Precision:  Is percentage of retrieved documents that are relevant to the query. (A / A + B) Relevant Irrelevant Retrieved Not retrieved A B C D 4/9/2024 36
  • 37. Query Language How users query?  The basic IR approach is Keyword-based search.  Queries are combinations of words. The document collection is searched for documents that contain these words. Word queries are intuitive, easy to express and provide fast ranking. There are different query language:  Single query,  Multiple query,  Boolean query, .... etc 4/9/2024 37
  • 38. Problems with Keywords May not retrieve relevant documents that include Synonymous terms (words with similar meaning).  “restaurant” vs. “café”  “Ethiopia” vs. “Abyssinia”  “Car” vs. “automobile”  “Buy” vs. “purchase”  “Movie” vs. “film” May retrieve irrelevant documents that include Polysemy terms (terms with multiple meaning).  “Apple” (company vs. fruit)  “Bit” (unit of data vs. act of eating)  “Bat” (baseball vs. mammal)  “Bank” (financial institution vs. river bank) 4/9/2024 38
  • 39. Relevance Feedback After initial retrieval results are presented, allow the user to provide feedback on the relevance of one or more of the retrieved documents. Use this feedback information to reformulate the query. Produce new results based on reformulated query. Allows more interactive, multi-pass process. Relevance feedback can be automated in such a way that it allows:  Users relevance feedback,  Pseudo relevance feedback. 4/9/2024 39
  • 40. Users Relevance Feedback Architecture 1. Doc1  2. Doc2  3. Doc3  . . Feedback Rankings IR System Document corpus Ranked Documents 1. Doc1 2. Doc2 3. Doc3 . . Query String Revised Query ReRanked Documents 1. Doc2 2. Doc1 3. Doc4 . . Query Reformulation 4/9/2024 40
  • 41. Challenges for IR researchers and practitioners Technical challenge: what tools should IR systems provide to allow effective and efficient manipulation of information within such diverse media of text, image, video and audio? Interaction challenge: what features should IR systems provide in order to support a wide variety of users in their search for relevant information. Evaluation challenge: how can we measure effectiveness of retrieval? which tools and features are effective and usable, given the increasing diversity of end-users and information seeking situations? 4/9/2024 41
  • 42. Assignments - One Pick three of the following concept (which is not taken by other students). Review literatures (books, articles & Internet) (concerning the meaning, function, pros and cons & application of the concept). 1. Information Retrieval 2. Search engine 3. Data retrieval 4. Cross language IR 5. Multilingual IR 6. Document image retrieval 7. Indexing 8. Tokenization 9. Stemming 10. Stop words 11. Normalization 12. Thesaurus 13. Searching 14. IR models 15. Term weighting 16. Similarity measurement 17. Retrieval effectiveness 18.Query language 19. Relevance feedback 20. Query Expansion 4/9/2024 42