SlideShare uma empresa Scribd logo
1 de 75
Introduction to
Information Retrieval
Carsten Eickhoff
What is it all about?
1 Introduction
• Google and other Web search providers are on
the rise for more than 10 years
• They advance to take the role of
“media all-rounders”
• On Coursera, you have been learning about
the range of products offered by Google
• In this lecture, we will try and understand how
these services are facilitated
IR Applications
• Many on-line activities involve IR technology
– Media consumption (music, videos, text, …)
– News tracking
– Social networking
– Online shopping
– Advertisement
– Mobile communication
Agenda
1. Introduction
2. IR Technology
a)
b)
c)
d)

Crawling
Indexing & Storage
Retrieval
Data-driven Technologies

3. IR for Children
4. Recommended Reading
Section 2

IR TECHNOLOGY
Searching and Term Matching

• Search engines do not understand queries!
• They only find things that are similar
Overview Web Search
Web

Query

Index

Documents

Result Set
Section 2.1

CRAWLING
How does the Engine know
what is out there?
Overview Web Search
Web

Query

Index

Documents

Result Set
Spiders, Crawlers, Robots…
• Search engine companies send hordes of lightweight programs through the web
• Whenever they encounter a new or updated
Web page, they save it.
• From there, they follow the page’s outgoing
links to discover new content
The Crawling Process
The Crawling Process
The Crawling Process
The Crawling Process
The Crawling Process
The Crawling Process
Crawling the Web
• Web crawling is an ongoing process
• To keep up with the pace at which information
on the Web changes, pages are re-crawled
continuously
• Search engine providers have to make sure
that their crawlers are polite (i.e., do not place
a high or peaking load on web servers)
The Ethics of Saving Everything…
• Large shares of the Web are of private or
proprietary nature
• Crawlers should not visit those
• “robots.txt” regulates simple access rules for
each Web server
Access Policy Examples
• Examples of robot policies in robots.txt
Example 1:
User-agent: *
Disallow: /cyberworld/map/
User-agent: cybermapper
Disallow:

Example 2:
User-agent: *
Disallow: /
The Hidden Web
The Deep Web
• Similar to the hidden Web, pages that only
become available after user interaction
• E.g., product pages in a Web shop
• Some crawlers can fill in information in forms
and explore the underlying pages
• But be careful, or your crawler buys a cruise
ship tour…
Section 2.2

INDEXING & STORAGE
Overview Web Search
Web

Query

Index

Documents

Result Set
Index Construction
• Naïve Approach:
– Make a copy of each page
– When a query comes in, look for the terms in all
saved pages

• But wait, is that clever?
– We need to answer queries in half a second…
Inverted Index
• Turn the problem around
• Build an inverted index that tells us which
document contains the query terms
• At query time, we only consider those
documents that contain at least one query
term
Example: Inverted Index
he

drink

ink

likes

pink

thing

wink

2

1

0

2

0

0

1

He likes to wink, he likes to drink.

1

3

0

1

0

0

0

He likes to drink, and drink, and drink.

1

1

1

1

0

1

0

The thing he likes to drink is ink.

1

1

1

1

1

0

0

The ink he likes to drink is pink.

1

1

1

1

1

0

1

He likes to wink and drink pink ink.
Example: Sparse Index
• Most fields of the matrix would be empty
• So we store it as a sparse matrix
he

D1:2

D2:1

D3:1

ink

D3:1

D4:1

D5:1

pink

D4:1

D5:1

thing

D3:1

wink

D1:1

D5:1

D4:1

D5:1
Positional Indices
• The inverted index allows us to know which
terms are in which documents
• But we lose position information
• Term proximity may be important
Query: “Square dance”
Doc 1: “…Times Square in new York is often host to dance performances...”
Doc 2: “… Square dance is a dance for four couples…”
Example: Positional Index
• To preserve position information, we store
each term occurrence as a tuple <doc,pos>
he

D1:1

D1:5

D2:1

ink

D3:8

D4:2

D5:8

pink

D4:8

D5:7

thing

D3:2

wink

D1:4

D5:4

D3:3

D4:3

D5:1
What if the index becomes too large?
Distributed Indices
• If we cannot fit the index on a single machine
any more, we split it up
• Such distributed architectures introduce new
problems
– Crawling
– Indexing
– Retrieval
Map Reduce
Section 2.3

RETRIEVAL
Overview Web Search
Web

Query

Index

Documents

Result Set
Boolean Retrieval
• The early days (1960’s-80’s)
• Searching in highly specialized collections
– Legal or medical documents
– Newspaper archives

• Searcher is a trained professional
Boolean Queries
• Complex queries in expert syntax describe
information need (E.g., Westlaw)
Information need:
Information on the legal theories involved in preventing the disclosure of
trade secrets by employees formerly employed by a competing company.
Query:
"trade secret" /s disclos! /s prevent /s employe!
Information need:
Requirements for disabled people to be able to access a workplace.
Query:
disab! /p access! /s work-site work-place (employment /3 place)
Boolean Retrieval Summary
+
• Intuitive

• Gives searcher direct
control over the retrieval
process

• High cognitive load during
query formulation and
result set exploration
• No ordering to result sets

• Requires expert syntax
Ranked Retrieval
•
•
•
•

Introduced in the early 1990’s
Returns documents in their order of relevance
New problem: How to determine relevance?
Retrieval model computes one relevance score
per document and sorts accordingly
Vector Space Models
• Geometric model
• Idea:
– Relevant documents are similar to the query
– The more similar they are, the higher they are
ranked
Vector Space Mappings
• Translate text documents into vectors
Document Text:
“IT WAS the best of times, it was the worst of times…”
it

was

the

best

of

times

worst

cat

1

1

1

1

1

1

1

0

it

was

the

best

of

times

worst

cat

Binary Vector:

2

2

2

1

2

2

1

0

Frequency-based Vector:
Documents in Space
• Each vector is a point in a high-dimensional
space
• We tend to stop drawing at 3 dimensions, but
10,000 are also no problem mathematically ;-)
• Recall: Our document index already looks like
a vector!
Distances in Vector Spaces
• Similar documents lie close together
dog
• (1,2)
δ=1
• (1,1)
δ = 1.41
(2,0)
cat
Probabilistic Models
• Alternatively, use probabilistic methods
• For each document, compute the probability
of relevance towards the query
• Rank by probability
Building Document Models
• Build a language model of each document
• Count term frequencies and divide by
document length
is

and

to

the

wink

thing

pink

likes

ink

drink

he

He likes to wink, he likes to
drink.
The thing he likes to drink is
ink.
He likes to wink and drink
pink ink.
Ranking Documents
• Compute the likelihood of the query being
generated by the document models
• Order documents by decreasing probability
PageRank
• Google’s fabled original ranking criterion

• Based on the idea that page authoritativeness
should be rewarded during ranking
• Authoritativeness scores are iteratively
propagated along hyperlinks
Example: PageRank
Section 2.4

DATA-DRIVEN TECHNOLOGIES
Everyone wants your Data.
But what do they do with it?
Search Personalization
Search Personalization
• People have very specific preferences and
interests (Topics, language, textual complexity,
location, etc.)
• Based on your previous search history, we can
find out about these preferences
• For future searches, the engine tries to
optimize for the newly found preference
Query Suggestion

• Query suggestions try to help people
formulate their information needs
• They are selected on the basis of frequent
queries that extend the current query terms
Spelling Correction
Data-driven Spelling Correction
• Idea: Consensus is strong / errors are random
Spelling

Frequency

albert einstein

4834

albert einstien

525

albert einstine

149

albert einsten

27

albert einsteins

25

albert einstain

11

albert einstin

10

albert eintein

9
Dangers: Welcome to the Filter Bubble
• Data-driven methods are very powerful
• But what happens to the niche information
need?
• What if I want to see that video that nobody
else likes?
• Diversification techniques can help but there
is a danger of drowning in the mainstream
Section 3

IR FOR CHILDREN
Meet the Users
The PuppyIR Project
• European Union research project on childfriendly information access
• Investigate the specific needs of children
• Create an open-source framework to cater for
these needs
• 5 Universities, 3 Business partners
The Internet – A Place full of Kids
• Children interact with the Internet at a young
age (~ 4 years old)
• They spend more time online
• ~ 40% of British 10-year-olds have regular
unsupervised Internet access
The Internet – A Place for Kids?
• Many media platforms are mirrors of popular
culture
• Large parts of their content might not be
suitable for children
• Most IR systems are designed with adult users
in mind, not children
How to learn what Children Need?
• Literature study
• Surveys
– 300 US parents and teachers

• User studies
– 49 Dutch elementary school children

• Query log analyses
– Thousands of young Yahoo users
Children’s Deficits
•
•
•
•
•
•

They type / spell badly
They browse rather than search
They are bad at keywording
They struggle with search interfaces
Everything is relevant
Complex content is a challenge
Web Page Classification
• Can we automatically determine whether a
web site is suitable for children?
– Based on its content
– Based on its link neighborhood
Video Classification
Content Simplification
• Sometimes we do not want to completely
exclude certain documents
• But they still contain complex language that is
hard to grasp for kids
• In these cases, we can
automatically offer
simplifications for
difficult/technical
terms
Demo: Museon/Gemeentemuseum
Demo: Emma Kinderziekenhuis
Section 4

RECOMMENDED READING
Introduction to IR

•
•
•
•

Manning, Raghavan, Schutze
Available online at:
http://nlp.stanford.edu/IR-book/
All-in-one introduction
Modern Information Retrieval
• Baeza-Yates, Ribeiro-Neito
• Good overview over
basic topics
Data Mining
• Witten, Frank
• Data mining and pattern
recognition essentials
Managing Gygabytes
• Witten, Moffat, Bell
• What to do when your data
gets large?
Speech and Language Processing
• Jurafsky, Martin
• Comprehensive overview
of speech and language
technology

Mais conteúdo relacionado

Mais procurados

Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsMounia Lalmas-Roelleke
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrievalKU Leuven
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Primya Tamil
 
Information retrieval 7 boolean model
Information retrieval 7 boolean modelInformation retrieval 7 boolean model
Information retrieval 7 boolean modelVaibhav Khanna
 
Lectures 1,2,3
Lectures 1,2,3Lectures 1,2,3
Lectures 1,2,3alaa223
 
The vector space model
The vector space modelThe vector space model
The vector space modelpkgosh
 
Information Retrieval Models
Information Retrieval ModelsInformation Retrieval Models
Information Retrieval ModelsNisha Arankandath
 
Information retrieval-systems notes
Information retrieval-systems notesInformation retrieval-systems notes
Information retrieval-systems notesBAIRAVI T
 
Functions of information retrival system(1)
Functions of information retrival system(1)Functions of information retrival system(1)
Functions of information retrival system(1)silambu111
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrievalssbd6985
 
Information retrieval (introduction)
Information  retrieval (introduction) Information  retrieval (introduction)
Information retrieval (introduction) Primya Tamil
 
INFORMATION RETRIEVAL Anandraj.L
INFORMATION RETRIEVAL Anandraj.LINFORMATION RETRIEVAL Anandraj.L
INFORMATION RETRIEVAL Anandraj.Lanujessy
 
Recommendation system
Recommendation systemRecommendation system
Recommendation systemDing Li
 
Boolean Retrieval
Boolean RetrievalBoolean Retrieval
Boolean Retrievalmghgk
 
Information retrieval 14 fuzzy set models of ir
Information retrieval 14 fuzzy set models of irInformation retrieval 14 fuzzy set models of ir
Information retrieval 14 fuzzy set models of irVaibhav Khanna
 
Information retrieval concept, practice and challenge
Information retrieval   concept, practice and challengeInformation retrieval   concept, practice and challenge
Information retrieval concept, practice and challengeGan Keng Hoon
 

Mais procurados (20)

Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & Models
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrieval
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models
 
Information retrieval 7 boolean model
Information retrieval 7 boolean modelInformation retrieval 7 boolean model
Information retrieval 7 boolean model
 
Lectures 1,2,3
Lectures 1,2,3Lectures 1,2,3
Lectures 1,2,3
 
The vector space model
The vector space modelThe vector space model
The vector space model
 
Vector space model in information retrieval
Vector space model in information retrievalVector space model in information retrieval
Vector space model in information retrieval
 
Information Retrieval Models
Information Retrieval ModelsInformation Retrieval Models
Information Retrieval Models
 
Information retrieval-systems notes
Information retrieval-systems notesInformation retrieval-systems notes
Information retrieval-systems notes
 
Functions of information retrival system(1)
Functions of information retrival system(1)Functions of information retrival system(1)
Functions of information retrival system(1)
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
Information retrieval (introduction)
Information  retrieval (introduction) Information  retrieval (introduction)
Information retrieval (introduction)
 
INFORMATION RETRIEVAL Anandraj.L
INFORMATION RETRIEVAL Anandraj.LINFORMATION RETRIEVAL Anandraj.L
INFORMATION RETRIEVAL Anandraj.L
 
Recommendation system
Recommendation systemRecommendation system
Recommendation system
 
Indexing Process.pptx
Indexing Process.pptxIndexing Process.pptx
Indexing Process.pptx
 
Boolean Retrieval
Boolean RetrievalBoolean Retrieval
Boolean Retrieval
 
Web search vs ir
Web search vs irWeb search vs ir
Web search vs ir
 
Lec1,2
Lec1,2Lec1,2
Lec1,2
 
Information retrieval 14 fuzzy set models of ir
Information retrieval 14 fuzzy set models of irInformation retrieval 14 fuzzy set models of ir
Information retrieval 14 fuzzy set models of ir
 
Information retrieval concept, practice and challenge
Information retrieval   concept, practice and challengeInformation retrieval   concept, practice and challenge
Information retrieval concept, practice and challenge
 

Destaque

Information Retrieval Models Part I
Information Retrieval Models Part IInformation Retrieval Models Part I
Information Retrieval Models Part IIngo Frommholz
 
Models for Information Retrieval and Recommendation
Models for Information Retrieval and RecommendationModels for Information Retrieval and Recommendation
Models for Information Retrieval and RecommendationArjen de Vries
 
كورس اساسيات استرجاع المعلومات المحاضره 1
كورس اساسيات استرجاع المعلومات المحاضره 1 كورس اساسيات استرجاع المعلومات المحاضره 1
كورس اساسيات استرجاع المعلومات المحاضره 1 Hind Altwirqi
 
Jarrar: Introduction to Information Retrieval
Jarrar: Introduction to Information RetrievalJarrar: Introduction to Information Retrieval
Jarrar: Introduction to Information RetrievalMustafa Jarrar
 
تطور نظم إسترجاع المعلومات
تطور نظم إسترجاع المعلوماتتطور نظم إسترجاع المعلومات
تطور نظم إسترجاع المعلوماتAhmed Al-ajamy
 
نظم استرجاع المعلومات باللغة العربية
نظم استرجاع المعلومات باللغة العربيةنظم استرجاع المعلومات باللغة العربية
نظم استرجاع المعلومات باللغة العربيةBeni-Suef University
 
العوامل المؤثرة في كفاءة عمليات استرجاع المعلومات
العوامل المؤثرة في كفاءة عمليات استرجاع المعلوماتالعوامل المؤثرة في كفاءة عمليات استرجاع المعلومات
العوامل المؤثرة في كفاءة عمليات استرجاع المعلوماتالدكتور طلال ناظم الزهيري
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesMatthew Lease
 
التعريف بنظم استرجاع المعلومات
التعريف بنظم استرجاع المعلوماتالتعريف بنظم استرجاع المعلومات
التعريف بنظم استرجاع المعلوماتHuda Farhan
 
محاضرتي الاولى
محاضرتي الاولىمحاضرتي الاولى
محاضرتي الاولىAmany Megahed
 
نظم استرجاع المعلومات
نظم استرجاع المعلوماتنظم استرجاع المعلومات
نظم استرجاع المعلوماتBeni-Suef University
 
Content Creation Process
Content Creation ProcessContent Creation Process
Content Creation ProcessSujan Patel
 

Destaque (14)

Information Retrieval Models Part I
Information Retrieval Models Part IInformation Retrieval Models Part I
Information Retrieval Models Part I
 
Models for Information Retrieval and Recommendation
Models for Information Retrieval and RecommendationModels for Information Retrieval and Recommendation
Models for Information Retrieval and Recommendation
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
 
كورس اساسيات استرجاع المعلومات المحاضره 1
كورس اساسيات استرجاع المعلومات المحاضره 1 كورس اساسيات استرجاع المعلومات المحاضره 1
كورس اساسيات استرجاع المعلومات المحاضره 1
 
Jarrar: Introduction to Information Retrieval
Jarrar: Introduction to Information RetrievalJarrar: Introduction to Information Retrieval
Jarrar: Introduction to Information Retrieval
 
تطور نظم إسترجاع المعلومات
تطور نظم إسترجاع المعلوماتتطور نظم إسترجاع المعلومات
تطور نظم إسترجاع المعلومات
 
نظم استرجاع المعلومات باللغة العربية
نظم استرجاع المعلومات باللغة العربيةنظم استرجاع المعلومات باللغة العربية
نظم استرجاع المعلومات باللغة العربية
 
العوامل المؤثرة في كفاءة عمليات استرجاع المعلومات
العوامل المؤثرة في كفاءة عمليات استرجاع المعلوماتالعوامل المؤثرة في كفاءة عمليات استرجاع المعلومات
العوامل المؤثرة في كفاءة عمليات استرجاع المعلومات
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
التعريف بنظم استرجاع المعلومات
التعريف بنظم استرجاع المعلوماتالتعريف بنظم استرجاع المعلومات
التعريف بنظم استرجاع المعلومات
 
محاضرتي الاولى
محاضرتي الاولىمحاضرتي الاولى
محاضرتي الاولى
 
نظم استرجاع المعلومات
نظم استرجاع المعلوماتنظم استرجاع المعلومات
نظم استرجاع المعلومات
 
IR
IRIR
IR
 
Content Creation Process
Content Creation ProcessContent Creation Process
Content Creation Process
 

Semelhante a Introduction to Information Retrieval

Searching of Web and Electronic Resources
Searching of Web and Electronic Resources Searching of Web and Electronic Resources
Searching of Web and Electronic Resources Bramesha B
 
SPSBE14 Intranet Search #fail
SPSBE14 Intranet Search #failSPSBE14 Intranet Search #fail
SPSBE14 Intranet Search #failBen van Mol
 
SharePoint Saturday Belgium 2013 Intranet search fail
SharePoint Saturday Belgium 2013 Intranet search failSharePoint Saturday Belgium 2013 Intranet search fail
SharePoint Saturday Belgium 2013 Intranet search failBIWUG
 
Managing eResources at Universities
Managing eResources at UniversitiesManaging eResources at Universities
Managing eResources at UniversitiesPK Mishra
 
05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptxGambari Amosa Isiaka
 
Web technology: Web search
Web technology: Web searchWeb technology: Web search
Web technology: Web searchVictor de Boer
 
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎Libcorpio
 
Internet browsing techniques
Internet browsing techniquesInternet browsing techniques
Internet browsing techniquesTola Odugbesan
 
Information Discovery and Search Strategies for Evidence-Based Research
Information Discovery and Search Strategies for Evidence-Based ResearchInformation Discovery and Search Strategies for Evidence-Based Research
Information Discovery and Search Strategies for Evidence-Based ResearchDavid Nzoputa Ofili
 
Evaluating Electronic Resources
Evaluating Electronic ResourcesEvaluating Electronic Resources
Evaluating Electronic ResourcesRichard Bernier
 
When Search becomes Research and Research becomes Search
When Search becomes Research and Research becomes SearchWhen Search becomes Research and Research becomes Search
When Search becomes Research and Research becomes SearchJaap Kamps
 
Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012Peter Mika
 
Adaptable Information Workshop slides
Adaptable Information Workshop slidesAdaptable Information Workshop slides
Adaptable Information Workshop slidesLouis Rosenfeld
 
Crowdsourcing or bust: The Indexer, Archives NZ
Crowdsourcing or bust: The Indexer, Archives NZ Crowdsourcing or bust: The Indexer, Archives NZ
Crowdsourcing or bust: The Indexer, Archives NZ donellemckinley
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 CareerBuilder.com
 
Search enabled applications with lucene.net
Search enabled applications with lucene.netSearch enabled applications with lucene.net
Search enabled applications with lucene.netWillem Meints
 
Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologiesenterprisesearchmeetup
 

Semelhante a Introduction to Information Retrieval (20)

Unit 1
Unit 1Unit 1
Unit 1
 
Searching of Web and Electronic Resources
Searching of Web and Electronic Resources Searching of Web and Electronic Resources
Searching of Web and Electronic Resources
 
SPSBE14 Intranet Search #fail
SPSBE14 Intranet Search #failSPSBE14 Intranet Search #fail
SPSBE14 Intranet Search #fail
 
SharePoint Saturday Belgium 2013 Intranet search fail
SharePoint Saturday Belgium 2013 Intranet search failSharePoint Saturday Belgium 2013 Intranet search fail
SharePoint Saturday Belgium 2013 Intranet search fail
 
Managing eResources at Universities
Managing eResources at UniversitiesManaging eResources at Universities
Managing eResources at Universities
 
05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx
 
Metadata
MetadataMetadata
Metadata
 
Web technology: Web search
Web technology: Web searchWeb technology: Web search
Web technology: Web search
 
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
 
Internet browsing techniques
Internet browsing techniquesInternet browsing techniques
Internet browsing techniques
 
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
 
Information Discovery and Search Strategies for Evidence-Based Research
Information Discovery and Search Strategies for Evidence-Based ResearchInformation Discovery and Search Strategies for Evidence-Based Research
Information Discovery and Search Strategies for Evidence-Based Research
 
Evaluating Electronic Resources
Evaluating Electronic ResourcesEvaluating Electronic Resources
Evaluating Electronic Resources
 
When Search becomes Research and Research becomes Search
When Search becomes Research and Research becomes SearchWhen Search becomes Research and Research becomes Search
When Search becomes Research and Research becomes Search
 
Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012
 
Adaptable Information Workshop slides
Adaptable Information Workshop slidesAdaptable Information Workshop slides
Adaptable Information Workshop slides
 
Crowdsourcing or bust: The Indexer, Archives NZ
Crowdsourcing or bust: The Indexer, Archives NZ Crowdsourcing or bust: The Indexer, Archives NZ
Crowdsourcing or bust: The Indexer, Archives NZ
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
 
Search enabled applications with lucene.net
Search enabled applications with lucene.netSearch enabled applications with lucene.net
Search enabled applications with lucene.net
 
Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologies
 

Mais de Carsten Eickhoff

Unsupervised Learning of General-Purpose Embeddings for User and Location Mod...
Unsupervised Learning of General-Purpose Embeddings for User and Location Mod...Unsupervised Learning of General-Purpose Embeddings for User and Location Mod...
Unsupervised Learning of General-Purpose Embeddings for User and Location Mod...Carsten Eickhoff
 
Web2Text: Deep Structured Boilerplate Removal
Web2Text: Deep Structured Boilerplate RemovalWeb2Text: Deep Structured Boilerplate Removal
Web2Text: Deep Structured Boilerplate RemovalCarsten Eickhoff
 
Cognitive Biases in Crowdsourcing
Cognitive Biases in CrowdsourcingCognitive Biases in Crowdsourcing
Cognitive Biases in CrowdsourcingCarsten Eickhoff
 
Evaluating Music Recommender Systems for Groups
Evaluating Music Recommender Systems for GroupsEvaluating Music Recommender Systems for Groups
Evaluating Music Recommender Systems for GroupsCarsten Eickhoff
 
Active Content-Based Crowdsourcing Task Selection
Active Content-Based Crowdsourcing Task SelectionActive Content-Based Crowdsourcing Task Selection
Active Content-Based Crowdsourcing Task SelectionCarsten Eickhoff
 
Efficient Parallel Learning of Word2Vec
Efficient Parallel Learning of Word2VecEfficient Parallel Learning of Word2Vec
Efficient Parallel Learning of Word2VecCarsten Eickhoff
 
An Eye-Tracking Study of Query Reformulation
An Eye-Tracking Study of Query ReformulationAn Eye-Tracking Study of Query Reformulation
An Eye-Tracking Study of Query ReformulationCarsten Eickhoff
 
Exploiting User Comments for Audio-visual Content Indexing and Retrieval (ECI...
Exploiting User Comments for Audio-visual Content Indexing and Retrieval (ECI...Exploiting User Comments for Audio-visual Content Indexing and Retrieval (ECI...
Exploiting User Comments for Audio-visual Content Indexing and Retrieval (ECI...Carsten Eickhoff
 

Mais de Carsten Eickhoff (8)

Unsupervised Learning of General-Purpose Embeddings for User and Location Mod...
Unsupervised Learning of General-Purpose Embeddings for User and Location Mod...Unsupervised Learning of General-Purpose Embeddings for User and Location Mod...
Unsupervised Learning of General-Purpose Embeddings for User and Location Mod...
 
Web2Text: Deep Structured Boilerplate Removal
Web2Text: Deep Structured Boilerplate RemovalWeb2Text: Deep Structured Boilerplate Removal
Web2Text: Deep Structured Boilerplate Removal
 
Cognitive Biases in Crowdsourcing
Cognitive Biases in CrowdsourcingCognitive Biases in Crowdsourcing
Cognitive Biases in Crowdsourcing
 
Evaluating Music Recommender Systems for Groups
Evaluating Music Recommender Systems for GroupsEvaluating Music Recommender Systems for Groups
Evaluating Music Recommender Systems for Groups
 
Active Content-Based Crowdsourcing Task Selection
Active Content-Based Crowdsourcing Task SelectionActive Content-Based Crowdsourcing Task Selection
Active Content-Based Crowdsourcing Task Selection
 
Efficient Parallel Learning of Word2Vec
Efficient Parallel Learning of Word2VecEfficient Parallel Learning of Word2Vec
Efficient Parallel Learning of Word2Vec
 
An Eye-Tracking Study of Query Reformulation
An Eye-Tracking Study of Query ReformulationAn Eye-Tracking Study of Query Reformulation
An Eye-Tracking Study of Query Reformulation
 
Exploiting User Comments for Audio-visual Content Indexing and Retrieval (ECI...
Exploiting User Comments for Audio-visual Content Indexing and Retrieval (ECI...Exploiting User Comments for Audio-visual Content Indexing and Retrieval (ECI...
Exploiting User Comments for Audio-visual Content Indexing and Retrieval (ECI...
 

Último

Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 

Último (20)

Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 

Introduction to Information Retrieval

  • 2. What is it all about?
  • 3. 1 Introduction • Google and other Web search providers are on the rise for more than 10 years • They advance to take the role of “media all-rounders” • On Coursera, you have been learning about the range of products offered by Google • In this lecture, we will try and understand how these services are facilitated
  • 4. IR Applications • Many on-line activities involve IR technology – Media consumption (music, videos, text, …) – News tracking – Social networking – Online shopping – Advertisement – Mobile communication
  • 5. Agenda 1. Introduction 2. IR Technology a) b) c) d) Crawling Indexing & Storage Retrieval Data-driven Technologies 3. IR for Children 4. Recommended Reading
  • 7. Searching and Term Matching • Search engines do not understand queries! • They only find things that are similar
  • 10. How does the Engine know what is out there?
  • 12. Spiders, Crawlers, Robots… • Search engine companies send hordes of lightweight programs through the web • Whenever they encounter a new or updated Web page, they save it. • From there, they follow the page’s outgoing links to discover new content
  • 19. Crawling the Web • Web crawling is an ongoing process • To keep up with the pace at which information on the Web changes, pages are re-crawled continuously • Search engine providers have to make sure that their crawlers are polite (i.e., do not place a high or peaking load on web servers)
  • 20. The Ethics of Saving Everything… • Large shares of the Web are of private or proprietary nature • Crawlers should not visit those • “robots.txt” regulates simple access rules for each Web server
  • 21. Access Policy Examples • Examples of robot policies in robots.txt Example 1: User-agent: * Disallow: /cyberworld/map/ User-agent: cybermapper Disallow: Example 2: User-agent: * Disallow: /
  • 23. The Deep Web • Similar to the hidden Web, pages that only become available after user interaction • E.g., product pages in a Web shop • Some crawlers can fill in information in forms and explore the underlying pages • But be careful, or your crawler buys a cruise ship tour…
  • 26. Index Construction • Naïve Approach: – Make a copy of each page – When a query comes in, look for the terms in all saved pages • But wait, is that clever? – We need to answer queries in half a second…
  • 27. Inverted Index • Turn the problem around • Build an inverted index that tells us which document contains the query terms • At query time, we only consider those documents that contain at least one query term
  • 28. Example: Inverted Index he drink ink likes pink thing wink 2 1 0 2 0 0 1 He likes to wink, he likes to drink. 1 3 0 1 0 0 0 He likes to drink, and drink, and drink. 1 1 1 1 0 1 0 The thing he likes to drink is ink. 1 1 1 1 1 0 0 The ink he likes to drink is pink. 1 1 1 1 1 0 1 He likes to wink and drink pink ink.
  • 29. Example: Sparse Index • Most fields of the matrix would be empty • So we store it as a sparse matrix he D1:2 D2:1 D3:1 ink D3:1 D4:1 D5:1 pink D4:1 D5:1 thing D3:1 wink D1:1 D5:1 D4:1 D5:1
  • 30. Positional Indices • The inverted index allows us to know which terms are in which documents • But we lose position information • Term proximity may be important Query: “Square dance” Doc 1: “…Times Square in new York is often host to dance performances...” Doc 2: “… Square dance is a dance for four couples…”
  • 31. Example: Positional Index • To preserve position information, we store each term occurrence as a tuple <doc,pos> he D1:1 D1:5 D2:1 ink D3:8 D4:2 D5:8 pink D4:8 D5:7 thing D3:2 wink D1:4 D5:4 D3:3 D4:3 D5:1
  • 32. What if the index becomes too large?
  • 33. Distributed Indices • If we cannot fit the index on a single machine any more, we split it up • Such distributed architectures introduce new problems – Crawling – Indexing – Retrieval
  • 37. Boolean Retrieval • The early days (1960’s-80’s) • Searching in highly specialized collections – Legal or medical documents – Newspaper archives • Searcher is a trained professional
  • 38. Boolean Queries • Complex queries in expert syntax describe information need (E.g., Westlaw) Information need: Information on the legal theories involved in preventing the disclosure of trade secrets by employees formerly employed by a competing company. Query: "trade secret" /s disclos! /s prevent /s employe! Information need: Requirements for disabled people to be able to access a workplace. Query: disab! /p access! /s work-site work-place (employment /3 place)
  • 39. Boolean Retrieval Summary + • Intuitive • Gives searcher direct control over the retrieval process • High cognitive load during query formulation and result set exploration • No ordering to result sets • Requires expert syntax
  • 40. Ranked Retrieval • • • • Introduced in the early 1990’s Returns documents in their order of relevance New problem: How to determine relevance? Retrieval model computes one relevance score per document and sorts accordingly
  • 41. Vector Space Models • Geometric model • Idea: – Relevant documents are similar to the query – The more similar they are, the higher they are ranked
  • 42. Vector Space Mappings • Translate text documents into vectors Document Text: “IT WAS the best of times, it was the worst of times…” it was the best of times worst cat 1 1 1 1 1 1 1 0 it was the best of times worst cat Binary Vector: 2 2 2 1 2 2 1 0 Frequency-based Vector:
  • 43. Documents in Space • Each vector is a point in a high-dimensional space • We tend to stop drawing at 3 dimensions, but 10,000 are also no problem mathematically ;-) • Recall: Our document index already looks like a vector!
  • 44. Distances in Vector Spaces • Similar documents lie close together dog • (1,2) δ=1 • (1,1) δ = 1.41 (2,0) cat
  • 45. Probabilistic Models • Alternatively, use probabilistic methods • For each document, compute the probability of relevance towards the query • Rank by probability
  • 46. Building Document Models • Build a language model of each document • Count term frequencies and divide by document length is and to the wink thing pink likes ink drink he He likes to wink, he likes to drink. The thing he likes to drink is ink. He likes to wink and drink pink ink.
  • 47. Ranking Documents • Compute the likelihood of the query being generated by the document models • Order documents by decreasing probability
  • 48. PageRank • Google’s fabled original ranking criterion • Based on the idea that page authoritativeness should be rewarded during ranking • Authoritativeness scores are iteratively propagated along hyperlinks
  • 51. Everyone wants your Data. But what do they do with it?
  • 53. Search Personalization • People have very specific preferences and interests (Topics, language, textual complexity, location, etc.) • Based on your previous search history, we can find out about these preferences • For future searches, the engine tries to optimize for the newly found preference
  • 54. Query Suggestion • Query suggestions try to help people formulate their information needs • They are selected on the basis of frequent queries that extend the current query terms
  • 56. Data-driven Spelling Correction • Idea: Consensus is strong / errors are random Spelling Frequency albert einstein 4834 albert einstien 525 albert einstine 149 albert einsten 27 albert einsteins 25 albert einstain 11 albert einstin 10 albert eintein 9
  • 57. Dangers: Welcome to the Filter Bubble • Data-driven methods are very powerful • But what happens to the niche information need? • What if I want to see that video that nobody else likes? • Diversification techniques can help but there is a danger of drowning in the mainstream
  • 58. Section 3 IR FOR CHILDREN
  • 60. The PuppyIR Project • European Union research project on childfriendly information access • Investigate the specific needs of children • Create an open-source framework to cater for these needs • 5 Universities, 3 Business partners
  • 61. The Internet – A Place full of Kids • Children interact with the Internet at a young age (~ 4 years old) • They spend more time online • ~ 40% of British 10-year-olds have regular unsupervised Internet access
  • 62. The Internet – A Place for Kids? • Many media platforms are mirrors of popular culture • Large parts of their content might not be suitable for children • Most IR systems are designed with adult users in mind, not children
  • 63. How to learn what Children Need? • Literature study • Surveys – 300 US parents and teachers • User studies – 49 Dutch elementary school children • Query log analyses – Thousands of young Yahoo users
  • 64. Children’s Deficits • • • • • • They type / spell badly They browse rather than search They are bad at keywording They struggle with search interfaces Everything is relevant Complex content is a challenge
  • 65. Web Page Classification • Can we automatically determine whether a web site is suitable for children? – Based on its content – Based on its link neighborhood
  • 67. Content Simplification • Sometimes we do not want to completely exclude certain documents • But they still contain complex language that is hard to grasp for kids • In these cases, we can automatically offer simplifications for difficult/technical terms
  • 71. Introduction to IR • • • • Manning, Raghavan, Schutze Available online at: http://nlp.stanford.edu/IR-book/ All-in-one introduction
  • 72. Modern Information Retrieval • Baeza-Yates, Ribeiro-Neito • Good overview over basic topics
  • 73. Data Mining • Witten, Frank • Data mining and pattern recognition essentials
  • 74. Managing Gygabytes • Witten, Moffat, Bell • What to do when your data gets large?
  • 75. Speech and Language Processing • Jurafsky, Martin • Comprehensive overview of speech and language technology