SlideShare uma empresa Scribd logo
1 de 62
Search on the Web
Victor de Boer
Web Technology 2015
Slides adapted from
Willem Robert van Hage
Overview
• Search engines:
– What do they do
– How do they work?
– How good are they? How to evaluate?
• Discover information laws by counting words
How does a search engine work?
• What is a search engine?
Classic Information Retrieval model
“Bank in Amsterdam”
Classic Information Retrieval model
“Bank in Amsterdam”
Classic Information Retrieval model
“Bank in Amsterdam”
Classic Information Retrieval model
How does a search engine work?
• How does a search engine know a document
matches your question?
• Words have different meaning, does the
search engine know which one you need ?
How does a search engine work?
In fact, most search engines do not know what you
mean, they just make a guess. If they read “bank”
they do not know if you mean a river bank or a
financial institution..
They usually return the pages that makes the majority
of the users happy.
/
How does a search engine work?
• So if you enter “bank”, the search engine does
not necessarily know what you mean.
• But what if you enter “bank transfer”?
How does a search engine work?
Than the search engine still does not “know” what
you mean, but will just return pages that both
mention “bank” and “transfer”. If these
correspond with what you meant that is a “mere
coincidence”
Not entirely, because the word “transfer” in
combination with “bank” makes the query more
informative than either of them separate.
Boolean search, ad-hoc query
Not only ad-hoc queries
• What if you do not know or what to enter as a
search term? (or do not want to?)
How does a search engine work?
Alternative search strategies:
• Browsing (Wikipedia, Yahoo! Directory)
• Social bookmarking (digg, de.licio.us)
• Recommender systems (stumbleupon, Amazon)
How does a search engine work?
• How can a search return documents from all
over the web in less than a quarter of a
second?
How does a search engine work?
Indexing (more later)
Multiple servers in parallel
Pre-selection based on time/origin/query
How does a search engine work?
• Does a search engine lookup the results live
on the Web?
flickr/photophilde
How does a search engine work?
• Does a search engine maintain a copy of each
document you can search for?
How does a search engine work?
No, the engine uses a kind of locally stored
summary of each page.
Not all pages are included, duplicates and junk
are thrown away
CRAWLING
PREPROCESSING
BUILDING INDEX
Crawling
• How does a search engine know your site exists?
Search engines follow links of pages they do know
already, so if someone else links to your site, the
engines will find you sooner or later.
This process is called “crawling”
Robots.txt
www.s1z.ru
How does a search engine work?
• Can you crawl
the entire web?
• How big is the
web anyway?
Hubs
Almost. The web has the nice property that there are
very few pages that link to many others and a lot of
pages that link to very few other pages.
Deep Web
In addition, there is the
"Deep web" , the part
of the web that isn’t
being linked to with a
fixed URL (for example,
data in a database)
Most of the “Deep Web”
is not crawled at all.
How Big is the Web?
http://www.factshunt.com/2014/01/total-number-of-websites-size-of.html
759 Million - Total number of websites on the Web
510 Million - Total number of Live websites (active).
14.3 Trillion - Webpages, live on the Internet.
48 Billion - Webpages indexed by Google.Inc.
14 Billion - Webpages indexed by Microsoft's Bing.
Third site on the Web
Nederlands instituut voor subatomaire
fysica Nikhef.
CRAWLING
PREPROCESSING
BUILDING INDEX
Back to building the index
Preprocessing
1. Remove HTML tags
2. Tokenization (“I am walking.” -> [I, am, walking])
3. Remove stop words (the, I, it,…)
4. Stemming (cars, car -> car ; walking, walks ->walk)
Result: for each doc, a list of terms
CRAWLING
PREPROCESSING
BUILDING INDEX
Term-document matrices
Shakespeare
Term-document incidence
1 if play contains word, 0 otherwise
Sec. 1.1
• So we have a 0/1 vector for each term.
• To answer query: take the vectors for Brutus, Caesar and Calpurnia
(complemented)  bitwise AND.
• 110100 AND 110111 AND 101111 = 100100.
Brutus AND Caesar BUT NOT Calpurnia
But? Bigger collections
• Consider 1 million documents, each with about 1000 words.
• Avg 6 bytes/word including spaces/punctuation
– 6GB of data in the documents.
• Say there are M = 500K distinct terms among these.
• 500K x 1M matrix has half-a-trillion 0’s and 1’s.
500.000.000.000
• But it has no more than one billion 1’s.
1.000.000.000
– matrix is extremely sparse: 1 / 1000.
• What’s a better representation?
– We only record the 1 positions.
34
Sec. 1.1
Inverted indices
Inverted index
• For each term t, we must store a list of all documents
that contain t.
– Identify each by a docID, a document serial number
36
Brutus
Calpurnia
Caesar 1 2 4 5 6 16 57 132
1 2 4 11 31 45 173
2 31
Sec. 1.2
174
54 101
Postings
(sorted by docID)
dictionary
Tokenizer
Token stream. Friends Romans Countrymen
Inverted index construction
Linguistic modules
Modified tokens. friend roman countryman
Indexer
Inverted index.
friend
roman
countryman
2 4
2
13 16
1
Documents to
be indexed.
Friends, Romans, countrymen.
Sec. 1.2
Indexer steps: Token sequence
• Sequence of (Modified token, Document ID) pairs.
I did enact Julius
Caesar I was killed
i' the Capitol;
Brutus killed me.
Doc 1
So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious
Doc 2
Sec. 1.2
Indexer steps: Sort
• Sort by terms
– And then docID
Core indexing step
Sec. 1.2
Indexer steps: Dictionary & Postings
• Multiple term entries
in a single document
are merged.
• Split into Dictionary
and Postings
• Doc. frequency
information is added.
Sec. 1.2
Index size
• How big can your index be on a single
machine?
• But let’s consider an uncompressed index of
one year of Reuters news messages does that
fit in main memory?
• How big does an index and dictionary
become?
Reuters RCV1 statistics
statistic value
documents 800,000
avg. # tokens per doc 200
terms (= word types) 400,000
avg. # bytes per token 4.5
(without spaces/punct.)
avg. # bytes per term 7.5
postings 100,000,000
Sec. 4.2
How well does a search engine work?
Measure it!
Select a representative set of queries
(e.g. from a server log).
Ask a representative set of human raters to
“judge” the relevance of all the search results.
Check if one engine is better than the other by
counting if they return more relevant pages
and less non-relevant ones (the whole truth /
nothing but the truth)
For how many questions is this the case. Is this
more than you would expect by pure chance?
Google
Yahoo!
How does a search engine work?
Tradeoff
better system
F-measure is the harmonic mean of precision and recall:
Google eye-tracking
agent-seo.com Cornell University Eye-Tracking Study Data
Clicks
agent-seo.com Cornell University Eye-Tracking Study Data
Next page?
Precision at N
• When the number of results grows larger, it
might not be relevant what the precision over
the entire set is, but only first N results.
• Precision at N/ P@n
• P@1 = 1.0
• P@5 = 0.6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 5 10 15 20 25 30
P@N
R@N
Ranking
State of the art search
engines use all kinds of
tricks for ranking.
Lets think of a few …
Example weighting scheme: tf.idf
Term Frequency
Inverse Document Frequency
Every word is assigned a weight for a document.
Some words are more important than others.
One version:
TFIDF Example
Term
Term
Count
this 1
is 1
a 2
sample 1
Term
Term
Count
this 1
is 1
another 2
example 3
Doc1 Doc2
Why the “Log”
• How often does the most common word
appear in a corpus? How often the second
most common? Etc.
– Split the books into words, cut them up on the
spaces and punctuation
– Delete all punctuation
– Sort all words
– Count the words
– Plot the counts
Zipf’s law
The most frequent word will occur approximately twice as often as the second most frequent
word, three times as often as the third most frequent word, etc.
Formally: the frequency of a word is inversely proportional to its rank in the frequency table.
wugology.com
Zipf’s law
On Logarithmic paper
But! Heaps’ Law
• Split the books into words, cut them up on the
spaces and punctuation
• Delete all punctuation
• Do not sort words
• Go over all words and count the number of
unique words you have seen
• Plot the results linearly.
Heaps’ law
• How fast does the dictionary grow?
Heap’s Law
Informally:
By scanning the text we will hit upon the most
common words rather quickly, but we will,
(increasingly slower), continue to encounter
(infrequent) new words.
Other Ranking tricks
• Localisation (language, but also your mobile
location)
• Personalisation
• Log analysis
• PageRank
PageRank (Page and Brin)
• Absolute score for a page
• Intuition: Pages that are linked to by
important pages are themselves important
i.e. the PageRank value for a page u is
dependent on the PageRank values for each
page v contained in the set Bu (the set
containing all pages linking to page u),
divided by the number L(v) of links from
page v. http://en.wikipedia.org/wiki/PageRank
So..
• Web search is a form of information retrieval with
the Web as corpus
• Inverted indexes are built using crawling,
processing and indexing
• A boolean query is then matched to the index,
returning pages that match
• How well a search engine works depends on user
judgement
– Precision, Recall and F-measure
• Ranking is key – especially in Web search
– There are many strategies for ranking, and being good
in ranking can make you very rich
Oh, and optimizing for Google’s ranking can
make you a bit rich, and a bit cool
https://www.youtube.com/watch?v=fnSJBpB_OKQ

Mais conteúdo relacionado

Mais procurados

Internet Research: Finding Websites, Blogs, Wikis, and More
Internet Research: Finding Websites, Blogs, Wikis, and MoreInternet Research: Finding Websites, Blogs, Wikis, and More
Internet Research: Finding Websites, Blogs, Wikis, and Moreeclark131
 
searching & copyright
searching & copyrightsearching & copyright
searching & copyrightLUZ PINGOL
 
Semantic Web and Schema.org
Semantic Web and Schema.orgSemantic Web and Schema.org
Semantic Web and Schema.orgrvguha
 
Related Entity Finding on the Web
Related Entity Finding on the WebRelated Entity Finding on the Web
Related Entity Finding on the WebPeter Mika
 
Microformats I: What & Why
Microformats I: What & WhyMicroformats I: What & Why
Microformats I: What & WhyRachael L Moore
 
Understanding Queries through Entities
Understanding Queries through EntitiesUnderstanding Queries through Entities
Understanding Queries through EntitiesPeter Mika
 
萌典與零時政府
萌典與零時政府萌典與零時政府
萌典與零時政府Au Tang
 
Digifoot 2012 ppt
Digifoot 2012 pptDigifoot 2012 ppt
Digifoot 2012 ppttpoelzer
 
Creating Web APIs with JSON-LD and RDF
Creating Web APIs with JSON-LD and RDFCreating Web APIs with JSON-LD and RDF
Creating Web APIs with JSON-LD and RDFdonaldlsmithjr
 
The hunt for the perfect interface in a googlified world
The hunt for the perfect interface in a googlified worldThe hunt for the perfect interface in a googlified world
The hunt for the perfect interface in a googlified worldnabot
 
Big Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLPBig Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLPChristian Morbidoni
 
Open Library at Make Books Apparent
Open Library at Make Books ApparentOpen Library at Make Books Apparent
Open Library at Make Books ApparentGeorge Oates
 
Searching the Internet
Searching the Internet Searching the Internet
Searching the Internet guest32ae6
 
The Simple Power of the Link
The Simple Power of the LinkThe Simple Power of the Link
The Simple Power of the LinkRichard Wallis
 
Week 1 - Interactive News Editing and Producing
Week 1 - Interactive News Editing and ProducingWeek 1 - Interactive News Editing and Producing
Week 1 - Interactive News Editing and Producingkurtgessler
 

Mais procurados (20)

Internet Research: Finding Websites, Blogs, Wikis, and More
Internet Research: Finding Websites, Blogs, Wikis, and MoreInternet Research: Finding Websites, Blogs, Wikis, and More
Internet Research: Finding Websites, Blogs, Wikis, and More
 
Metadata
MetadataMetadata
Metadata
 
searching & copyright
searching & copyrightsearching & copyright
searching & copyright
 
Searching tech2
Searching tech2Searching tech2
Searching tech2
 
Ted Talk
Ted TalkTed Talk
Ted Talk
 
Research 101
Research 101Research 101
Research 101
 
Semantic Web and Schema.org
Semantic Web and Schema.orgSemantic Web and Schema.org
Semantic Web and Schema.org
 
Related Entity Finding on the Web
Related Entity Finding on the WebRelated Entity Finding on the Web
Related Entity Finding on the Web
 
Microformats I: What & Why
Microformats I: What & WhyMicroformats I: What & Why
Microformats I: What & Why
 
Understanding Queries through Entities
Understanding Queries through EntitiesUnderstanding Queries through Entities
Understanding Queries through Entities
 
萌典與零時政府
萌典與零時政府萌典與零時政府
萌典與零時政府
 
Digifoot 2012 ppt
Digifoot 2012 pptDigifoot 2012 ppt
Digifoot 2012 ppt
 
Creating Web APIs with JSON-LD and RDF
Creating Web APIs with JSON-LD and RDFCreating Web APIs with JSON-LD and RDF
Creating Web APIs with JSON-LD and RDF
 
The hunt for the perfect interface in a googlified world
The hunt for the perfect interface in a googlified worldThe hunt for the perfect interface in a googlified world
The hunt for the perfect interface in a googlified world
 
Big Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLPBig Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLP
 
Open Library at Make Books Apparent
Open Library at Make Books ApparentOpen Library at Make Books Apparent
Open Library at Make Books Apparent
 
Searching the Internet
Searching the Internet Searching the Internet
Searching the Internet
 
The Simple Power of the Link
The Simple Power of the LinkThe Simple Power of the Link
The Simple Power of the Link
 
Week 1 - Interactive News Editing and Producing
Week 1 - Interactive News Editing and ProducingWeek 1 - Interactive News Editing and Producing
Week 1 - Interactive News Editing and Producing
 
howtoresearch colai
howtoresearch colaihowtoresearch colai
howtoresearch colai
 

Destaque

Information seeking
Information seekingInformation seeking
Information seekingJohan Koren
 
Practical Elasticsearch - real world use cases
Practical Elasticsearch - real world use casesPractical Elasticsearch - real world use cases
Practical Elasticsearch - real world use casesItamar
 
An introduction to inverted index
An introduction to inverted indexAn introduction to inverted index
An introduction to inverted indexweedge
 
Information searching & retrieving techniques khalid
Information searching & retrieving techniques khalidInformation searching & retrieving techniques khalid
Information searching & retrieving techniques khalidKhalid Mahmood
 
Elasticsearch Distributed search & analytics on BigData made easy
Elasticsearch Distributed search & analytics on BigData made easyElasticsearch Distributed search & analytics on BigData made easy
Elasticsearch Distributed search & analytics on BigData made easyItamar
 
What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...Simon Lia-Jonassen
 

Destaque (10)

NLP new words
NLP new wordsNLP new words
NLP new words
 
Intro to NLP. Lecture 2
Intro to NLP.  Lecture 2Intro to NLP.  Lecture 2
Intro to NLP. Lecture 2
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
Information seeking
Information seekingInformation seeking
Information seeking
 
Inverted index
Inverted indexInverted index
Inverted index
 
Practical Elasticsearch - real world use cases
Practical Elasticsearch - real world use casesPractical Elasticsearch - real world use cases
Practical Elasticsearch - real world use cases
 
An introduction to inverted index
An introduction to inverted indexAn introduction to inverted index
An introduction to inverted index
 
Information searching & retrieving techniques khalid
Information searching & retrieving techniques khalidInformation searching & retrieving techniques khalid
Information searching & retrieving techniques khalid
 
Elasticsearch Distributed search & analytics on BigData made easy
Elasticsearch Distributed search & analytics on BigData made easyElasticsearch Distributed search & analytics on BigData made easy
Elasticsearch Distributed search & analytics on BigData made easy
 
What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...
 

Semelhante a Web technology: Web search

ICT in Learning Process by A.Alekper
ICT in Learning Process by A.AlekperICT in Learning Process by A.Alekper
ICT in Learning Process by A.AlekperAlekper Alekperov
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalCarsten Eickhoff
 
Information Discovery and Search Strategies for Evidence-Based Research
Information Discovery and Search Strategies for Evidence-Based ResearchInformation Discovery and Search Strategies for Evidence-Based Research
Information Discovery and Search Strategies for Evidence-Based ResearchDavid Nzoputa Ofili
 
Usable Language | How Content Shapes The User Experience
Usable Language | How Content Shapes The User ExperienceUsable Language | How Content Shapes The User Experience
Usable Language | How Content Shapes The User ExperienceRandall Snare
 
Search engines by Gulshan K Maheshwari(QAU)
Search engines by Gulshan  K Maheshwari(QAU)Search engines by Gulshan  K Maheshwari(QAU)
Search engines by Gulshan K Maheshwari(QAU)GulshanKumar368
 
Using Search Analytics to Diagnose What’s Ailing your Information Architecture
Using Search Analytics to Diagnose What’s Ailing your Information ArchitectureUsing Search Analytics to Diagnose What’s Ailing your Information Architecture
Using Search Analytics to Diagnose What’s Ailing your Information ArchitectureLouis Rosenfeld
 
Searchland: Search quality for Beginners
Searchland: Search quality for BeginnersSearchland: Search quality for Beginners
Searchland: Search quality for BeginnersValeria de Paiva
 
Ordering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect dataOrdering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect dataAndy Stretton
 
Search Analytics for Fun and Profit
Search Analytics for Fun and ProfitSearch Analytics for Fun and Profit
Search Analytics for Fun and ProfitLouis Rosenfeld
 
Evaluating search engines
Evaluating search enginesEvaluating search engines
Evaluating search enginesPhil Bradley
 
Charting Searchland, ACM SIG Data Mining
Charting Searchland, ACM SIG Data MiningCharting Searchland, ACM SIG Data Mining
Charting Searchland, ACM SIG Data MiningValeria de Paiva
 
05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptxGambari Amosa Isiaka
 
Adaptable Information Workshop slides
Adaptable Information Workshop slidesAdaptable Information Workshop slides
Adaptable Information Workshop slidesLouis Rosenfeld
 
Academic Skills 4
Academic Skills 4Academic Skills 4
Academic Skills 4Hala Nur
 

Semelhante a Web technology: Web search (20)

ICT in Learning Process by A.Alekper
ICT in Learning Process by A.AlekperICT in Learning Process by A.Alekper
ICT in Learning Process by A.Alekper
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Information Discovery and Search Strategies for Evidence-Based Research
Information Discovery and Search Strategies for Evidence-Based ResearchInformation Discovery and Search Strategies for Evidence-Based Research
Information Discovery and Search Strategies for Evidence-Based Research
 
Usable Language | How Content Shapes The User Experience
Usable Language | How Content Shapes The User ExperienceUsable Language | How Content Shapes The User Experience
Usable Language | How Content Shapes The User Experience
 
Web Search Engine
Web Search EngineWeb Search Engine
Web Search Engine
 
Search engines by Gulshan K Maheshwari(QAU)
Search engines by Gulshan  K Maheshwari(QAU)Search engines by Gulshan  K Maheshwari(QAU)
Search engines by Gulshan K Maheshwari(QAU)
 
Using Search Analytics to Diagnose What’s Ailing your Information Architecture
Using Search Analytics to Diagnose What’s Ailing your Information ArchitectureUsing Search Analytics to Diagnose What’s Ailing your Information Architecture
Using Search Analytics to Diagnose What’s Ailing your Information Architecture
 
Searchland: Search quality for Beginners
Searchland: Search quality for BeginnersSearchland: Search quality for Beginners
Searchland: Search quality for Beginners
 
DC presentation 1
DC presentation 1DC presentation 1
DC presentation 1
 
Ordering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect dataOrdering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect data
 
Search Analytics for Fun and Profit
Search Analytics for Fun and ProfitSearch Analytics for Fun and Profit
Search Analytics for Fun and Profit
 
Evaluating search engines
Evaluating search enginesEvaluating search engines
Evaluating search engines
 
Search Engine
Search EngineSearch Engine
Search Engine
 
Charting Searchland, ACM SIG Data Mining
Charting Searchland, ACM SIG Data MiningCharting Searchland, ACM SIG Data Mining
Charting Searchland, ACM SIG Data Mining
 
Searchland2
Searchland2Searchland2
Searchland2
 
Starting a search application
Starting a search applicationStarting a search application
Starting a search application
 
05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx
 
Search engines
Search enginesSearch engines
Search engines
 
Adaptable Information Workshop slides
Adaptable Information Workshop slidesAdaptable Information Workshop slides
Adaptable Information Workshop slides
 
Academic Skills 4
Academic Skills 4Academic Skills 4
Academic Skills 4
 

Mais de Victor de Boer

One day workshop Linked Data and Semantic Web
One day workshop Linked Data and Semantic WebOne day workshop Linked Data and Semantic Web
One day workshop Linked Data and Semantic WebVictor de Boer
 
Linked Data for Digital Humanities research at Media Archives
Linked Data for Digital Humanities research at Media ArchivesLinked Data for Digital Humanities research at Media Archives
Linked Data for Digital Humanities research at Media ArchivesVictor de Boer
 
The Benefits of Linking Metadata for Internal and External users of an Audiov...
The Benefits of Linking Metadata for Internal and External users of an Audiov...The Benefits of Linking Metadata for Internal and External users of an Audiov...
The Benefits of Linking Metadata for Internal and External users of an Audiov...Victor de Boer
 
UX Challenges of Information Organisation: Assessment of Language Impairment ...
UX Challenges of Information Organisation: Assessment of Language Impairment ...UX Challenges of Information Organisation: Assessment of Language Impairment ...
UX Challenges of Information Organisation: Assessment of Language Impairment ...Victor de Boer
 
Interactive Dance Choreography Assistance presentation for ACE entertainment ...
Interactive Dance Choreography Assistance presentation for ACE entertainment ...Interactive Dance Choreography Assistance presentation for ACE entertainment ...
Interactive Dance Choreography Assistance presentation for ACE entertainment ...Victor de Boer
 
Fahad Ali's slides for Machine to-machine communication in rural conditions ...
Fahad Ali's slides for Machine to-machine communication in rural conditions  ...Fahad Ali's slides for Machine to-machine communication in rural conditions  ...
Fahad Ali's slides for Machine to-machine communication in rural conditions ...Victor de Boer
 
Linking African Traditional Medicine Knowledge - by Gossa Lo
Linking African Traditional Medicine Knowledge - by Gossa LoLinking African Traditional Medicine Knowledge - by Gossa Lo
Linking African Traditional Medicine Knowledge - by Gossa LoVictor de Boer
 
Enriching Media Collections for Event-based Exploration
Enriching Media Collections for Event-based ExplorationEnriching Media Collections for Event-based Exploration
Enriching Media Collections for Event-based ExplorationVictor de Boer
 
New Life for Old Media (NEM presentation)
New Life for Old Media  (NEM presentation)New Life for Old Media  (NEM presentation)
New Life for Old Media (NEM presentation)Victor de Boer
 
User-centered Data Science for Digital Humanities
User-centered Data Science for Digital HumanitiesUser-centered Data Science for Digital Humanities
User-centered Data Science for Digital HumanitiesVictor de Boer
 
Linked Data for Audiovisual Archives (Guest lecture at NISV)
Linked Data for Audiovisual Archives (Guest lecture at NISV)Linked Data for Audiovisual Archives (Guest lecture at NISV)
Linked Data for Audiovisual Archives (Guest lecture at NISV)Victor de Boer
 
Semantic Technology for Development: Semantic Web without the Web?
Semantic Technology for Development: Semantic Web without the Web?Semantic Technology for Development: Semantic Web without the Web?
Semantic Technology for Development: Semantic Web without the Web?Victor de Boer
 
DIVE+ and Events at EVENTS2017
DIVE+ and Events at EVENTS2017DIVE+ and Events at EVENTS2017
DIVE+ and Events at EVENTS2017Victor de Boer
 
Intro to Linked, Dutch Ships and Sailors and SPARQL handson
Intro to Linked, Dutch Ships and Sailors and SPARQL handson Intro to Linked, Dutch Ships and Sailors and SPARQL handson
Intro to Linked, Dutch Ships and Sailors and SPARQL handson Victor de Boer
 
Kasadaka and ICT4D at VU
Kasadaka and ICT4D at VUKasadaka and ICT4D at VU
Kasadaka and ICT4D at VUVictor de Boer
 
VU ICT4D symposium 2017 Francis Dittoh Mr. Meteo
VU ICT4D symposium 2017 Francis Dittoh  Mr. MeteoVU ICT4D symposium 2017 Francis Dittoh  Mr. Meteo
VU ICT4D symposium 2017 Francis Dittoh Mr. MeteoVictor de Boer
 
VU ICT4D symposium 2017 Chris van Aart
VU ICT4D symposium 2017 Chris van AartVU ICT4D symposium 2017 Chris van Aart
VU ICT4D symposium 2017 Chris van AartVictor de Boer
 
VU ICT4D symposium 2017 Gayo Diallo Towards a Digital African Traditional Hea...
VU ICT4D symposium 2017 Gayo Diallo Towards a Digital African Traditional Hea...VU ICT4D symposium 2017 Gayo Diallo Towards a Digital African Traditional Hea...
VU ICT4D symposium 2017 Gayo Diallo Towards a Digital African Traditional Hea...Victor de Boer
 
VU ICT4D symposium 2017 Wendelien Tuyp: Boosting african agriculture
VU ICT4D symposium 2017 Wendelien Tuyp: Boosting african agriculture VU ICT4D symposium 2017 Wendelien Tuyp: Boosting african agriculture
VU ICT4D symposium 2017 Wendelien Tuyp: Boosting african agriculture Victor de Boer
 

Mais de Victor de Boer (20)

One day workshop Linked Data and Semantic Web
One day workshop Linked Data and Semantic WebOne day workshop Linked Data and Semantic Web
One day workshop Linked Data and Semantic Web
 
Linked Data for Digital Humanities research at Media Archives
Linked Data for Digital Humanities research at Media ArchivesLinked Data for Digital Humanities research at Media Archives
Linked Data for Digital Humanities research at Media Archives
 
The Benefits of Linking Metadata for Internal and External users of an Audiov...
The Benefits of Linking Metadata for Internal and External users of an Audiov...The Benefits of Linking Metadata for Internal and External users of an Audiov...
The Benefits of Linking Metadata for Internal and External users of an Audiov...
 
UX Challenges of Information Organisation: Assessment of Language Impairment ...
UX Challenges of Information Organisation: Assessment of Language Impairment ...UX Challenges of Information Organisation: Assessment of Language Impairment ...
UX Challenges of Information Organisation: Assessment of Language Impairment ...
 
Interactive Dance Choreography Assistance presentation for ACE entertainment ...
Interactive Dance Choreography Assistance presentation for ACE entertainment ...Interactive Dance Choreography Assistance presentation for ACE entertainment ...
Interactive Dance Choreography Assistance presentation for ACE entertainment ...
 
Fahad Ali's slides for Machine to-machine communication in rural conditions ...
Fahad Ali's slides for Machine to-machine communication in rural conditions  ...Fahad Ali's slides for Machine to-machine communication in rural conditions  ...
Fahad Ali's slides for Machine to-machine communication in rural conditions ...
 
Linking African Traditional Medicine Knowledge - by Gossa Lo
Linking African Traditional Medicine Knowledge - by Gossa LoLinking African Traditional Medicine Knowledge - by Gossa Lo
Linking African Traditional Medicine Knowledge - by Gossa Lo
 
Enriching Media Collections for Event-based Exploration
Enriching Media Collections for Event-based ExplorationEnriching Media Collections for Event-based Exploration
Enriching Media Collections for Event-based Exploration
 
New Life for Old Media (NEM presentation)
New Life for Old Media  (NEM presentation)New Life for Old Media  (NEM presentation)
New Life for Old Media (NEM presentation)
 
User-centered Data Science for Digital Humanities
User-centered Data Science for Digital HumanitiesUser-centered Data Science for Digital Humanities
User-centered Data Science for Digital Humanities
 
Linked Data for Audiovisual Archives (Guest lecture at NISV)
Linked Data for Audiovisual Archives (Guest lecture at NISV)Linked Data for Audiovisual Archives (Guest lecture at NISV)
Linked Data for Audiovisual Archives (Guest lecture at NISV)
 
Semantic Technology for Development: Semantic Web without the Web?
Semantic Technology for Development: Semantic Web without the Web?Semantic Technology for Development: Semantic Web without the Web?
Semantic Technology for Development: Semantic Web without the Web?
 
DIVE+ and Events at EVENTS2017
DIVE+ and Events at EVENTS2017DIVE+ and Events at EVENTS2017
DIVE+ and Events at EVENTS2017
 
About Cultuurlink
About CultuurlinkAbout Cultuurlink
About Cultuurlink
 
Intro to Linked, Dutch Ships and Sailors and SPARQL handson
Intro to Linked, Dutch Ships and Sailors and SPARQL handson Intro to Linked, Dutch Ships and Sailors and SPARQL handson
Intro to Linked, Dutch Ships and Sailors and SPARQL handson
 
Kasadaka and ICT4D at VU
Kasadaka and ICT4D at VUKasadaka and ICT4D at VU
Kasadaka and ICT4D at VU
 
VU ICT4D symposium 2017 Francis Dittoh Mr. Meteo
VU ICT4D symposium 2017 Francis Dittoh  Mr. MeteoVU ICT4D symposium 2017 Francis Dittoh  Mr. Meteo
VU ICT4D symposium 2017 Francis Dittoh Mr. Meteo
 
VU ICT4D symposium 2017 Chris van Aart
VU ICT4D symposium 2017 Chris van AartVU ICT4D symposium 2017 Chris van Aart
VU ICT4D symposium 2017 Chris van Aart
 
VU ICT4D symposium 2017 Gayo Diallo Towards a Digital African Traditional Hea...
VU ICT4D symposium 2017 Gayo Diallo Towards a Digital African Traditional Hea...VU ICT4D symposium 2017 Gayo Diallo Towards a Digital African Traditional Hea...
VU ICT4D symposium 2017 Gayo Diallo Towards a Digital African Traditional Hea...
 
VU ICT4D symposium 2017 Wendelien Tuyp: Boosting african agriculture
VU ICT4D symposium 2017 Wendelien Tuyp: Boosting african agriculture VU ICT4D symposium 2017 Wendelien Tuyp: Boosting african agriculture
VU ICT4D symposium 2017 Wendelien Tuyp: Boosting african agriculture
 

Último

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxAmanpreet Kaur
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701bronxfugly43
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseAnaAcapella
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024Elizabeth Walsh
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSCeline George
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the ClassroomPooky Knightsmith
 
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdfssuserdda66b
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxJisc
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 

Último (20)

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 

Web technology: Web search

  • 1. Search on the Web Victor de Boer Web Technology 2015 Slides adapted from Willem Robert van Hage
  • 2. Overview • Search engines: – What do they do – How do they work? – How good are they? How to evaluate? • Discover information laws by counting words
  • 3. How does a search engine work? • What is a search engine?
  • 5. “Bank in Amsterdam” Classic Information Retrieval model
  • 6. “Bank in Amsterdam” Classic Information Retrieval model
  • 7. “Bank in Amsterdam” Classic Information Retrieval model
  • 8. How does a search engine work? • How does a search engine know a document matches your question? • Words have different meaning, does the search engine know which one you need ?
  • 9. How does a search engine work? In fact, most search engines do not know what you mean, they just make a guess. If they read “bank” they do not know if you mean a river bank or a financial institution.. They usually return the pages that makes the majority of the users happy. /
  • 10. How does a search engine work? • So if you enter “bank”, the search engine does not necessarily know what you mean. • But what if you enter “bank transfer”?
  • 11. How does a search engine work? Than the search engine still does not “know” what you mean, but will just return pages that both mention “bank” and “transfer”. If these correspond with what you meant that is a “mere coincidence” Not entirely, because the word “transfer” in combination with “bank” makes the query more informative than either of them separate. Boolean search, ad-hoc query
  • 12. Not only ad-hoc queries • What if you do not know or what to enter as a search term? (or do not want to?)
  • 13. How does a search engine work? Alternative search strategies: • Browsing (Wikipedia, Yahoo! Directory) • Social bookmarking (digg, de.licio.us) • Recommender systems (stumbleupon, Amazon)
  • 14. How does a search engine work? • How can a search return documents from all over the web in less than a quarter of a second?
  • 15. How does a search engine work? Indexing (more later) Multiple servers in parallel Pre-selection based on time/origin/query
  • 16. How does a search engine work? • Does a search engine lookup the results live on the Web? flickr/photophilde
  • 17. How does a search engine work? • Does a search engine maintain a copy of each document you can search for?
  • 18. How does a search engine work? No, the engine uses a kind of locally stored summary of each page. Not all pages are included, duplicates and junk are thrown away
  • 20. Crawling • How does a search engine know your site exists? Search engines follow links of pages they do know already, so if someone else links to your site, the engines will find you sooner or later. This process is called “crawling”
  • 21.
  • 23. How does a search engine work? • Can you crawl the entire web? • How big is the web anyway?
  • 24. Hubs Almost. The web has the nice property that there are very few pages that link to many others and a lot of pages that link to very few other pages.
  • 25. Deep Web In addition, there is the "Deep web" , the part of the web that isn’t being linked to with a fixed URL (for example, data in a database) Most of the “Deep Web” is not crawled at all.
  • 26. How Big is the Web? http://www.factshunt.com/2014/01/total-number-of-websites-size-of.html 759 Million - Total number of websites on the Web 510 Million - Total number of Live websites (active). 14.3 Trillion - Webpages, live on the Internet. 48 Billion - Webpages indexed by Google.Inc. 14 Billion - Webpages indexed by Microsoft's Bing.
  • 27. Third site on the Web Nederlands instituut voor subatomaire fysica Nikhef.
  • 29. Preprocessing 1. Remove HTML tags 2. Tokenization (“I am walking.” -> [I, am, walking]) 3. Remove stop words (the, I, it,…) 4. Stemming (cars, car -> car ; walking, walks ->walk) Result: for each doc, a list of terms
  • 33. Term-document incidence 1 if play contains word, 0 otherwise Sec. 1.1 • So we have a 0/1 vector for each term. • To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented)  bitwise AND. • 110100 AND 110111 AND 101111 = 100100. Brutus AND Caesar BUT NOT Calpurnia
  • 34. But? Bigger collections • Consider 1 million documents, each with about 1000 words. • Avg 6 bytes/word including spaces/punctuation – 6GB of data in the documents. • Say there are M = 500K distinct terms among these. • 500K x 1M matrix has half-a-trillion 0’s and 1’s. 500.000.000.000 • But it has no more than one billion 1’s. 1.000.000.000 – matrix is extremely sparse: 1 / 1000. • What’s a better representation? – We only record the 1 positions. 34 Sec. 1.1
  • 36. Inverted index • For each term t, we must store a list of all documents that contain t. – Identify each by a docID, a document serial number 36 Brutus Calpurnia Caesar 1 2 4 5 6 16 57 132 1 2 4 11 31 45 173 2 31 Sec. 1.2 174 54 101 Postings (sorted by docID) dictionary
  • 37. Tokenizer Token stream. Friends Romans Countrymen Inverted index construction Linguistic modules Modified tokens. friend roman countryman Indexer Inverted index. friend roman countryman 2 4 2 13 16 1 Documents to be indexed. Friends, Romans, countrymen. Sec. 1.2
  • 38. Indexer steps: Token sequence • Sequence of (Modified token, Document ID) pairs. I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. Doc 1 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious Doc 2 Sec. 1.2
  • 39. Indexer steps: Sort • Sort by terms – And then docID Core indexing step Sec. 1.2
  • 40. Indexer steps: Dictionary & Postings • Multiple term entries in a single document are merged. • Split into Dictionary and Postings • Doc. frequency information is added. Sec. 1.2
  • 41. Index size • How big can your index be on a single machine? • But let’s consider an uncompressed index of one year of Reuters news messages does that fit in main memory? • How big does an index and dictionary become?
  • 42. Reuters RCV1 statistics statistic value documents 800,000 avg. # tokens per doc 200 terms (= word types) 400,000 avg. # bytes per token 4.5 (without spaces/punct.) avg. # bytes per term 7.5 postings 100,000,000 Sec. 4.2
  • 43. How well does a search engine work? Measure it! Select a representative set of queries (e.g. from a server log). Ask a representative set of human raters to “judge” the relevance of all the search results. Check if one engine is better than the other by counting if they return more relevant pages and less non-relevant ones (the whole truth / nothing but the truth) For how many questions is this the case. Is this more than you would expect by pure chance? Google Yahoo!
  • 44. How does a search engine work?
  • 45. Tradeoff better system F-measure is the harmonic mean of precision and recall:
  • 46. Google eye-tracking agent-seo.com Cornell University Eye-Tracking Study Data
  • 47. Clicks agent-seo.com Cornell University Eye-Tracking Study Data
  • 49. Precision at N • When the number of results grows larger, it might not be relevant what the precision over the entire set is, but only first N results. • Precision at N/ P@n • P@1 = 1.0 • P@5 = 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 5 10 15 20 25 30 P@N R@N
  • 50. Ranking State of the art search engines use all kinds of tricks for ranking. Lets think of a few …
  • 51. Example weighting scheme: tf.idf Term Frequency Inverse Document Frequency Every word is assigned a weight for a document. Some words are more important than others. One version:
  • 52. TFIDF Example Term Term Count this 1 is 1 a 2 sample 1 Term Term Count this 1 is 1 another 2 example 3 Doc1 Doc2
  • 53. Why the “Log” • How often does the most common word appear in a corpus? How often the second most common? Etc. – Split the books into words, cut them up on the spaces and punctuation – Delete all punctuation – Sort all words – Count the words – Plot the counts
  • 54. Zipf’s law The most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc. Formally: the frequency of a word is inversely proportional to its rank in the frequency table. wugology.com
  • 56. But! Heaps’ Law • Split the books into words, cut them up on the spaces and punctuation • Delete all punctuation • Do not sort words • Go over all words and count the number of unique words you have seen • Plot the results linearly.
  • 57. Heaps’ law • How fast does the dictionary grow?
  • 58. Heap’s Law Informally: By scanning the text we will hit upon the most common words rather quickly, but we will, (increasingly slower), continue to encounter (infrequent) new words.
  • 59. Other Ranking tricks • Localisation (language, but also your mobile location) • Personalisation • Log analysis • PageRank
  • 60. PageRank (Page and Brin) • Absolute score for a page • Intuition: Pages that are linked to by important pages are themselves important i.e. the PageRank value for a page u is dependent on the PageRank values for each page v contained in the set Bu (the set containing all pages linking to page u), divided by the number L(v) of links from page v. http://en.wikipedia.org/wiki/PageRank
  • 61. So.. • Web search is a form of information retrieval with the Web as corpus • Inverted indexes are built using crawling, processing and indexing • A boolean query is then matched to the index, returning pages that match • How well a search engine works depends on user judgement – Precision, Recall and F-measure • Ranking is key – especially in Web search – There are many strategies for ranking, and being good in ranking can make you very rich
  • 62. Oh, and optimizing for Google’s ranking can make you a bit rich, and a bit cool https://www.youtube.com/watch?v=fnSJBpB_OKQ

Notas do Editor

  1. Essentially, a user, driven by an information need, constructs a query in some query language. The query is submitted to a system that selects from a collection of documents (corpus), those documents that match the query as indicated by certain matching rules. A query refinement process might be used to create new queries and/or to refine the results. (Figure 1)
  2. First website: http://info.cern.ch/hypertext/WWW/TheProject.html Nederlands instituut voor subatomaire fysica Nikhef. Derde!
  3. 6%
  4. TODO: n bij rechtergrafiek Hoeveel mensen klikken door
  5. tODO: ook niet-log papier