Information Retrieval is about how we can search and retrieve things. In this talk, we look at the various components that make up a typical search engine and discuss the associated challenges.
2. Disclaimer
Most examples and discussions in this talk revolve
around well known search engines. This is just to get
a good learning experience. Please keep in mind
that IR is beyond search engines.
25+ slides of interesting discussion ahead…
2/2014 Venkatesh Vinayakarao 2
3. Quiz
1. Explain any two challenges in Query Intent
Understanding using some examples and discuss
why is it a hard problem?
2. How are “Tiles” as discussed in the class used in
search engines? What purpose do they solve?
3. Search Engines have no UI related design
concerns. True/False?
2/2014 Venkatesh Vinayakarao 3
4. About Me
BE Computer Science
(Y2K)
MS (IT)
IT Service Industry
Start Up
Nokia
Yahoo
Microsoft (Bing)
PhD
Let me learn
everything all
over again!
2/2014 Venkatesh Vinayakarao 4
5. Our Agenda: The Beauty of IR!
Crawling Content Processing Indexing
Me!
Query (Intent)
Understanding
Ranking User Interface
Offline Horror!
Online Terror!
How to process
Korean queries for
local listings?
2/2014 Venkatesh Vinayakarao 5
6. Crawling
How frequently should we crawl?
Fresh & Super-Fresh! How to crawl cricket scores? Are we
even crawling here?
How to avoid 404 - Page not found?
How much time did it take google to show your first
personal page?
2/2014 Venkatesh Vinayakarao 6
8. Content Processing
Query: “Schools in Delhi”
Answer: “Delhi Public School”
Good or Bad?
Query: “Schools in Hyderabad”
Answer: “Delhi Public School”
Good or Bad?
Query: “Hotels in Bombay”
Answer: “Grand Hyatt, Mumbai”
Good or Bad?
How to get same results for both Mumbai and Bombay?
Query: “Maruti Car service in delhi”
Answer: “Rana Motors Private Limited”.
What happened?
2/2014 Venkatesh Vinayakarao 8
9. Content Processing & Indexing
A real example:
http://www.yelp.com/dataset_challenge/
Enriched Business
• Category Synonyms (for eg., auto service & car service are replaceable at times)
• User’s query forms (for eg., McDonalds is commonly queried as McD)
2/2014 Venkatesh Vinayakarao 9
10. Derived Values & Indexing
Given a location, how will you find all businesses
within 1km radius?
Query: schools near govindpuri delhi
2/2014 Venkatesh Vinayakarao 10
12. Rules
I will give an entity name.
You will have to frame at least three different
(dissimilar) queries (and as many as you can) that
give same document as the correct result at first
place.
At the end, you should submit:
Query, Max. no. of top n correct results that you
maintained to be same.
You will have 5 minutes.
2/2014 Venkatesh Vinayakarao 12
13. Questions
Tom Cruise
Aishwarya Rai
Tom Hanks
Srikanta Bedathur
Venkatesh Vinayakarao
Pankaj Jalote
Amir Khan
Andre Agassi
Manmohan Singh
2/2014 Venkatesh Vinayakarao 13
14. Query Understanding
Query: Michael Jordon
Which MJ to return? The basketball player or actor?
Factors
User profile
Query context (session details, browser data, links, etc)
…
Query: Delhi School
What does user want? “Delhi Public School” or
“Schools in Delhi” or “some Indian school in US”?
Query: “IR”
Predict top three results
2/2014 Venkatesh Vinayakarao 14
15. Ok! I give up!!
A frustrated search user: “please show me some t-
shirt brands”
2/2014 Venkatesh Vinayakarao 15
16. More fun with auto completion
2/2014 Venkatesh Vinayakarao 16
17. System Overview (Simplified)
Front-end Front-end Front-end Front-end
Query Understanding, Query Classifiers
Web Answer Local Answer
Finance
Answer
Tech Answer &
Many more
KB
Index Serve Crawled
Content
Crawler
Web
Expanded Query
User Query
2/2014 Venkatesh Vinayakarao 17
18. Ranking & Relevance
How do we know if the document is relevant (in
web search context)?
Popularity of url
Domain score (is it ac.in or .edu?)
TF, IDF
Entity, Chain entity?
Trust Factor (Wikipedia?)
Inlinks/Outlinks
Position of query terms
Sequence of query terms
… and 1000 of such things
2/2014 Venkatesh Vinayakarao 18
19. Are current search engines good at
relevance & ranking?
Bing Google
Query1: Vegetarian hotels in south delhi
Query2: South Indian hotels in south delhi
2/2014 Venkatesh Vinayakarao 19
20. …More examples
Query3: South Indian restaurants in south delhi
What’s the difference between query2 and query3? Should search engines give different
results?
2/2014 Venkatesh Vinayakarao 20
21. How far for a coffee?
Google: Just one word
(iiitd) missing. So
what?
Let’s make the query as “coffee shops
near iiitd delhi”.
“Coffee shops near me” gives results
from Janakpuri, Gurgaon, CP & Kamla
Nagar.
2/2014 Venkatesh Vinayakarao 21
22. Why is it hard?
What makes Ranking & Relevance hard?
2/2014 Venkatesh Vinayakarao 22
23. User Interface
Is UI important for search engine?
Maps in local results
Live sport score cards
Finance tickers
Filters
Search Operators
Entity Infoboxes
What impact does these
make?
2/2014 Venkatesh Vinayakarao 23
24. Our Agenda: The Beauty of IR!
Crawling Content Processing Indexing
Me!
Query (Intent)
Understanding
Ranking User Interface
Offline Horror!
Online Terror!
How to process
Korean queries for
local listings?
2/2014 Venkatesh Vinayakarao 24
25. Evaluation
Various evaluation methods
Precision/Recall
Mean Avg Precision
Mean Reciprocal Rank
If first relevant doc is at kth position, RR = 1/k.
NDCG
Non-Boolean/Graded relevance scores
DCG = r1 + r2/log22 + r3/log23 + … rn/log2n
2/2014 Venkatesh Vinayakarao 25
26. NDCG - Example
i
Ground Truth Ranking Function1 Ranking Function2
Document
Order
ri
Document
Order
ri
Document
Order
ri
1 d4 2 d3 2 d3 2
2 d3 2 d4 2 d2 1
3 d2 1 d2 1 d4 2
4 d1 0 d1 0 d1 0
NDCGGT=1.00 NDCGRF1=1.00 NDCGRF2=0.9203
4 documents: d1, d2, d3, d4
Taken from http://www.stanford.edu/class/cs276/handouts/EvaluationNew.ppt
2/2014 Venkatesh Vinayakarao 26