This document discusses concepts and challenges in text retrieval for search engines. It provides an overview of text retrieval and search engine concepts. Some key challenges discussed are semantics and specificity in queries. The document also uses an example of an expert search engine to illustrate a case study. It describes various components involved in text retrieval including document representation, indexing, inverted indexing, retrieval functions and evaluation metrics.
8. Type of Search Engine
Web Search Engine
Google,Yahoo, Bing
Domain Specific Search Engine
Medline/Pubmed
Microsoft Academic
Desktop Search Engine
Copernic
8
9. ConnectingTwo Ends
Search
Collection
Web
Domain
Specific
Personal
Enterprise
Etc.
Information
Needs
I want to know more
about the keynotes
speech of ICAICTA
2016.
I need more
Pokeballs
Free Of
Charge..…
What’s so funny
about FuYuan
Hui??
Scholarship
ending soon,
three months
left to submit
my thesis….
Web Sites
Journal
Articles
News
Images
Videos
Audio
Scanned
Documents
Tweets
Posts
Reviews
Etc…
9
10. A Conceptual Model forText Retrieval
Information Needs
Query
Search Collection
Document
Representation
Retrieved
Documents
Indexing
Formulation
Retrieval Function
Relevance Feedback
Natural Language
Content Analysis
10
12. SearchCollection (Retrieval Unit)
Web pages, email, books, news stories, scholarly
papers, text messages,Word™, Powerpoint™, PDF,
forum postings, patents, etc.
Retrieval unit can be
Part of document, e.g. a paragraph, a slide, a page etc.
In the form different structure, html, xml, text etc.
In different sizes/length.
12
13. Document Representation
FullText Representation
Keep everything. Complete.
Require huge resources.Too much may not be good.
Reduced (partial) Content Representation
Remove not important contents e.g. stopwords.
Standardization to reduce overlapped contents e.g. stemming.
Retain only important contents, e.g. noun phrases, header etc.
13
14. Document Representation
Think of representation as some ways of storing the document.
Bag of Words Model
Store the words as the bag (multiset) of its words,
disregarding grammar and even word order.
Document 1: "The cat sat on the hat"
Document 2: "The dog ate the cat and the hat"
From these two documents, a word list is constructed:
{ the, cat, sat, on, hat, dog, ate, and }
The list has 8 distinct words.
Document 1: { 2, 1, 1, 1, 1, 0, 0, 0 }
Document 2 : { 3, 1, 0, 0, 1, 1, 1, 1}
14
15. Information Needs & Query
Information Needs != Query
Recall the information needs
Query: icaicta 2016 keynote
Information Need: I want to know more about the keynotes speech of
ICAICTA 2016
Query: free pokeball
Information Need: I need more Pokeballs. I don’t want to pay. No cheat
codes.
15
16. Retrieved Documents
From the original collection, a subset of documents are obtained.
What is the factor that determines what document to return?
SimpleTerm Matching Approach
1. Compare the terms in a document and query.
2. Compute “similarity” between each document in the collection and
the query based on the terms they have in common.
3. Sorting the document in order of decreasing similarity with the
query.
4. The outputs are a ranked list and displayed to the user - the top ones
are more relevant as judged by the system.
16
17. Indexing
Convert documents into
representation or data structure to
improve the efficiency of retrieval.
To generate a set of useful terms
called indexes.
Why?
Many variety of words used in texts,
but not all are important.
Among the important words, some
are more contextually relevant.
Some basic processes
involved
•Tokenization
•StopWords Removal
•Stemming
•Phrases
•Inverted File
17
18. Indexing (Tokenization)
Convert a sequence of characters
into a sequence of tokens with
some basic meaning.
“The cat chases the mouse.”
“Bigcorp's 2007 bi-annual report
showed profits rose 10%.”
the
cat
chases
the
mouse
bigcorp
2007
bi
annual
report
showed
profits
rose
10%
18
19. Indexing (Tokenization)
Token can be single or multiple terms.
“Samsung Galaxy S7 Edge, redefines what a phone can do.”
samsung galaxy s7 edge
redefines
what
a
phone
can
do
samsung
galaxy
s7
edge
redefines
what
a ….
or
19
20. Indexing (Tokenization)
Common Issues
1. Capitalized words can have different meaning from lower case words
Bush fires the officer. Query: Bush fire
The bush fire lasted for 3 days. Query: bush fire
2. Apostrophes can be a part of a word, a part of a possessive, or just a
mistake
rosie o'donnell, can't, don't, 80's, 1890's, men's straw hats, master's
degree, england's ten largest cities, shriner's
20
21. Indexing (Tokenization)
3. Numbers can be important, including decimals
nokia 3250, top 10 courses, united 93, quicktime 6.5 pro, 92.3 the
beat, 288358
4. Periods can occur in numbers, abbreviations, URLs, ends of
sentences, and other situations
I.B.M., Ph.D., cs.umass.edu, F.E.A.R.
Note: tokenizing steps for queries must be identical to steps for
documents
21
22. Indexing (Stopping)
Top 50 Words from AP89 News
Collection
Recall,
Indexes should be useful term links
to a document.
Are the terms on the right figure
useful?
22
23. Indexing (Stopping)
Stopword list can be created from high-frequency words or based
on a standard list
Lists are customized for applications, domains, and even parts of
documents
e.g., “click” is a good stopword for anchor text
Best policy is to index all words in documents, make decisions
about which words to use at query time?
23
24. Indexing (Stemming)
Many morphological variations of words
inflectional (plurals, tenses)
derivational (making verbs nouns etc.)
In most cases, these have the same or very similar meanings
Stemmers attempt to reduce morphological variations of words
to a common stem
usually involves removing suffixes
Can be done at indexing time or as part of query processing (like
stopwords)
24
25. Indexing (Stemming)
Porter Stemmer
Algorithmic stemmer used in
IR experiments since the 70s
Consists of a series of rules
designed to the longest
possible suffix at each step
Produces stems not words
Example Step 1 (right figure)
25
26. Indexing (Phrases)
Recall, token, meaningful tokens are better indexes, e.g.
phrases.
Text processing issue – how are phrases recognized?
Three possible approaches:
Identify syntactic phrases using a part-of-speech (POS) tagger
Use word n-grams
Store word positions in indexes and use proximity operators in
queries
26
28. Indexing (Inverted Index)
Recall, indexes are designed to support search.
Each index term is associated with an inverted list
Contains lists of documents, or lists of word occurrences in documents, and
other information.
Each entry is called a posting.
The part of the posting that refers to a specific document or location
is called a pointer
Each document in the collection is given a unique number
Lists are usually document-ordered (sorted by document number)
28
33. Retrieval Function
Ranking
Documents are retrieved in sorted order according to a score
computing using the document representation, the query, and a
ranking algorithm
33
34. Retrieval Function (Vector Space Model)
Ranked based method.
Documents and query represented by a vector of term
weights.
Collection represented by a matrix of term weights.
34
35. Retrieval Function (Vector Space Model)
borneo daily new north straits times
D1 0 0 1 0 1 1
D2 0 1 1 0 1 0
D3 1 0 0 1 0 1
D1: new straits times
D2: new straits daily
D3 : north borneo times
Vector of useful terms
35
36. Retrieval Function (Vector Space Model)
borneo daily new north straits times
D1 0 0 0.176 0 0.176 0.176
D2 0 0.477 0.176 0 0.176 0
D3 0.477 0 0 0.477 0 0.176
idf (borneo) = log(3/1) =0.477
idf (daily) = log(3/1) = 0.477
idf (new) = log(3/2) =0.176
idf (north) = log(3/1) = 0.477
idf (straits) = log(3/2) = 0.176
idf (times) = log(3/2) = 0.176
then multiply by tf
tf.idf weight
Term frequency weight measures
importance in document:
Inverse document frequency measures
importance in collection:
Note: Doc Length,Term Location,Term Semantic Meaning
36
37. Retrieval Function (Vector Space Model)
Documents ranked by distance between points
representing query and documents
Similarity measure more common than a distance or dissimilarity
measure
e.g. Cosine correlation
37
38. Retrieval Function (Vector Space Model)
Consider two documents D1, D2 and a query Q
Q = “straits times”
Compare against collection, D1 = “new straits times”
(borneo, daily, new, north, straits, times)
Q = (0, 0, 0, 0, 0.176, 0.176)
D1 = (0, 0, 0.176, 0, 0.176, 0.176)
D2 = (0, 0.477, 0.176, 0, 0.176, 0)
𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐷𝐷𝐷, 𝑄𝑄 =
0∗0 + 0∗0 + 0.176∗0 + 0∗0 + 0.176∗0.176 +(0.176∗0.176)
0.1762
+0.1762
+0.1762
(0.1762
+0.1762
)
=0.816
Find Cosine (D2,Q).
Which document is
more relevant?
38
39. Evaluation
A must to evaluate the retrieval function, preprocessing
steps etc.
StandardCollection
Task specific
Human experts are used to judge relevant results.
Performance Metric
Precision
Recall
39