Slides for VU Web Technology course lecture on "Search on the Web". Explaining how search engines work, some basic information laws and inverted indices.
1. Search on the Web
Victor de Boer
Web Technology 2015
Slides adapted from
Willem Robert van Hage
2. Overview
• Search engines:
– What do they do
– How do they work?
– How good are they? How to evaluate?
• Discover information laws by counting words
3. How does a search engine work?
• What is a search engine?
8. How does a search engine work?
• How does a search engine know a document
matches your question?
• Words have different meaning, does the
search engine know which one you need ?
9. How does a search engine work?
In fact, most search engines do not know what you
mean, they just make a guess. If they read “bank”
they do not know if you mean a river bank or a
financial institution..
They usually return the pages that makes the majority
of the users happy.
/
10. How does a search engine work?
• So if you enter “bank”, the search engine does
not necessarily know what you mean.
• But what if you enter “bank transfer”?
11. How does a search engine work?
Than the search engine still does not “know” what
you mean, but will just return pages that both
mention “bank” and “transfer”. If these
correspond with what you meant that is a “mere
coincidence”
Not entirely, because the word “transfer” in
combination with “bank” makes the query more
informative than either of them separate.
Boolean search, ad-hoc query
12. Not only ad-hoc queries
• What if you do not know or what to enter as a
search term? (or do not want to?)
13. How does a search engine work?
Alternative search strategies:
• Browsing (Wikipedia, Yahoo! Directory)
• Social bookmarking (digg, de.licio.us)
• Recommender systems (stumbleupon, Amazon)
14. How does a search engine work?
• How can a search return documents from all
over the web in less than a quarter of a
second?
15. How does a search engine work?
Indexing (more later)
Multiple servers in parallel
Pre-selection based on time/origin/query
16. How does a search engine work?
• Does a search engine lookup the results live
on the Web?
flickr/photophilde
17. How does a search engine work?
• Does a search engine maintain a copy of each
document you can search for?
18. How does a search engine work?
No, the engine uses a kind of locally stored
summary of each page.
Not all pages are included, duplicates and junk
are thrown away
20. Crawling
• How does a search engine know your site exists?
Search engines follow links of pages they do know
already, so if someone else links to your site, the
engines will find you sooner or later.
This process is called “crawling”
23. How does a search engine work?
• Can you crawl
the entire web?
• How big is the
web anyway?
24. Hubs
Almost. The web has the nice property that there are
very few pages that link to many others and a lot of
pages that link to very few other pages.
25. Deep Web
In addition, there is the
"Deep web" , the part
of the web that isn’t
being linked to with a
fixed URL (for example,
data in a database)
Most of the “Deep Web”
is not crawled at all.
26. How Big is the Web?
http://www.factshunt.com/2014/01/total-number-of-websites-size-of.html
759 Million - Total number of websites on the Web
510 Million - Total number of Live websites (active).
14.3 Trillion - Webpages, live on the Internet.
48 Billion - Webpages indexed by Google.Inc.
14 Billion - Webpages indexed by Microsoft's Bing.
27. Third site on the Web
Nederlands instituut voor subatomaire
fysica Nikhef.
29. Preprocessing
1. Remove HTML tags
2. Tokenization (“I am walking.” -> [I, am, walking])
3. Remove stop words (the, I, it,…)
4. Stemming (cars, car -> car ; walking, walks ->walk)
Result: for each doc, a list of terms
33. Term-document incidence
1 if play contains word, 0 otherwise
Sec. 1.1
• So we have a 0/1 vector for each term.
• To answer query: take the vectors for Brutus, Caesar and Calpurnia
(complemented) bitwise AND.
• 110100 AND 110111 AND 101111 = 100100.
Brutus AND Caesar BUT NOT Calpurnia
34. But? Bigger collections
• Consider 1 million documents, each with about 1000 words.
• Avg 6 bytes/word including spaces/punctuation
– 6GB of data in the documents.
• Say there are M = 500K distinct terms among these.
• 500K x 1M matrix has half-a-trillion 0’s and 1’s.
500.000.000.000
• But it has no more than one billion 1’s.
1.000.000.000
– matrix is extremely sparse: 1 / 1000.
• What’s a better representation?
– We only record the 1 positions.
34
Sec. 1.1
36. Inverted index
• For each term t, we must store a list of all documents
that contain t.
– Identify each by a docID, a document serial number
36
Brutus
Calpurnia
Caesar 1 2 4 5 6 16 57 132
1 2 4 11 31 45 173
2 31
Sec. 1.2
174
54 101
Postings
(sorted by docID)
dictionary
37. Tokenizer
Token stream. Friends Romans Countrymen
Inverted index construction
Linguistic modules
Modified tokens. friend roman countryman
Indexer
Inverted index.
friend
roman
countryman
2 4
2
13 16
1
Documents to
be indexed.
Friends, Romans, countrymen.
Sec. 1.2
38. Indexer steps: Token sequence
• Sequence of (Modified token, Document ID) pairs.
I did enact Julius
Caesar I was killed
i' the Capitol;
Brutus killed me.
Doc 1
So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious
Doc 2
Sec. 1.2
40. Indexer steps: Dictionary & Postings
• Multiple term entries
in a single document
are merged.
• Split into Dictionary
and Postings
• Doc. frequency
information is added.
Sec. 1.2
41. Index size
• How big can your index be on a single
machine?
• But let’s consider an uncompressed index of
one year of Reuters news messages does that
fit in main memory?
• How big does an index and dictionary
become?
42. Reuters RCV1 statistics
statistic value
documents 800,000
avg. # tokens per doc 200
terms (= word types) 400,000
avg. # bytes per token 4.5
(without spaces/punct.)
avg. # bytes per term 7.5
postings 100,000,000
Sec. 4.2
43. How well does a search engine work?
Measure it!
Select a representative set of queries
(e.g. from a server log).
Ask a representative set of human raters to
“judge” the relevance of all the search results.
Check if one engine is better than the other by
counting if they return more relevant pages
and less non-relevant ones (the whole truth /
nothing but the truth)
For how many questions is this the case. Is this
more than you would expect by pure chance?
Google
Yahoo!
49. Precision at N
• When the number of results grows larger, it
might not be relevant what the precision over
the entire set is, but only first N results.
• Precision at N/ P@n
• P@1 = 1.0
• P@5 = 0.6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 5 10 15 20 25 30
P@N
R@N
50. Ranking
State of the art search
engines use all kinds of
tricks for ranking.
Lets think of a few …
51. Example weighting scheme: tf.idf
Term Frequency
Inverse Document Frequency
Every word is assigned a weight for a document.
Some words are more important than others.
One version:
53. Why the “Log”
• How often does the most common word
appear in a corpus? How often the second
most common? Etc.
– Split the books into words, cut them up on the
spaces and punctuation
– Delete all punctuation
– Sort all words
– Count the words
– Plot the counts
54. Zipf’s law
The most frequent word will occur approximately twice as often as the second most frequent
word, three times as often as the third most frequent word, etc.
Formally: the frequency of a word is inversely proportional to its rank in the frequency table.
wugology.com
56. But! Heaps’ Law
• Split the books into words, cut them up on the
spaces and punctuation
• Delete all punctuation
• Do not sort words
• Go over all words and count the number of
unique words you have seen
• Plot the results linearly.
58. Heap’s Law
Informally:
By scanning the text we will hit upon the most
common words rather quickly, but we will,
(increasingly slower), continue to encounter
(infrequent) new words.
59. Other Ranking tricks
• Localisation (language, but also your mobile
location)
• Personalisation
• Log analysis
• PageRank
60. PageRank (Page and Brin)
• Absolute score for a page
• Intuition: Pages that are linked to by
important pages are themselves important
i.e. the PageRank value for a page u is
dependent on the PageRank values for each
page v contained in the set Bu (the set
containing all pages linking to page u),
divided by the number L(v) of links from
page v. http://en.wikipedia.org/wiki/PageRank
61. So..
• Web search is a form of information retrieval with
the Web as corpus
• Inverted indexes are built using crawling,
processing and indexing
• A boolean query is then matched to the index,
returning pages that match
• How well a search engine works depends on user
judgement
– Precision, Recall and F-measure
• Ranking is key – especially in Web search
– There are many strategies for ranking, and being good
in ranking can make you very rich
62. Oh, and optimizing for Google’s ranking can
make you a bit rich, and a bit cool
https://www.youtube.com/watch?v=fnSJBpB_OKQ
Notas do Editor
Essentially, a user, driven by an information need, constructs a query in some query language. The
query is submitted to a system that selects from a collection of documents (corpus), those documents
that match the query as indicated by certain matching rules. A query refinement process might be
used to create new queries and/or to refine the results. (Figure 1)
First website: http://info.cern.ch/hypertext/WWW/TheProject.html
Nederlands instituut voor subatomaire fysica Nikhef. Derde!
6%
TODO: n bij rechtergrafiek
Hoeveel mensen klikken door