2. What is Search Engine
Search engine is a software program that
searches for sites based on the words that you
designate as search terms.
"Search engine" is the popular term for an
Information Retrieval (IR) system.
2
3. Motto of search engines
A web search engine is designed to search for
information on the World Wide Web and
FTP servers. The search results are generally
presented in a list of results often referred to
as SERPS, or "search engine results pages".
The information may consist of web pages,
images, information and other types of files.
3
4. Purpose of Search Engines
Helping people find what they’re looking
for
• Starts with an "information need"
• Convert to a query
• Gets results
In the materials available
• Web pages
• Other formats
• Deep Web
4
5. HISTORY
Archie – First search tool for the Internet
Gopher – indexed plain text documents
Jughead – searched the files stored in
Gopher index systems
Wandex – First Web search engine
5
6. How web search engines work
search engine operates in the following
order:
Web Crawling
Indexing
Searching
6
8. Search is Not a Panacea
Search can’t find what’s not there
• The content is hugely important
Information Architecture is vital
Usable sites have good navigation and
structure
8
9. Search Engine Modules
A query processor
A search and matching function
A ranking capability
Summarizing and Presenting documents.
9
10. Search Engines Mode of Working in
Earlier Days
From 1990-1998 (1st Generation of search
tools):
• Looked at title of web pages
• Ranking was based on page content
• Looked at number of times the search term
appeared on the page
• Looked at metatags
10
11. SEO (Search Engine Optimization)
Used by companies to get a higher result in
search engines
White hat: Using legitimate techniques
Black hat: Using illegal techniques to trick
the search engine, like paying sites to link
to you.
11
13. Search is Only as Good as the Content
Users blame the search engine
• Even when the content is unavailable
Understand the scope of site or intranet
• Kinds of information
• Divided sites: products / corporate info
• Dates
• Languages
• Sources and data silos: databases...
• Update processes
13
14. Making a Searchable Index
Store text to search it later
Many ways to gather text
• Crawl (spider) via HTTP
• Read files on file servers
• Access databases (HTTP or API)
• Data silos via local APIs
• Applications, CMSs, via Web Services
Security and Access Control
14
16. What the Index Needs
Basic information for document or record
• File name / URL / record ID
• Title or equivalent
• Size, date, MIME type
Full text of item
More metadata
• Product name, picture ID
• Category, topic, or subject
• Other attributes, for relevance ranking and display
16
19. Search Query Processing
What happens after you click the search
button, and before retrieval starts.
Usually in this order
• Handle character set, maybe language
• Look for operators and organize the query
• Look for field names or metadata
• Extract words (just like the indexer)
• Deal with letter casing
19
20. Search and Retrieval
Retrieval: find files with query terms
Not the same as relevance ranking
Recall: find all
relevant items
Precision: find only
relevant items
Increasing one
decreases the other
20
21. Retrieval = Matching
Single-word queries
• Find items containing that word
Multi-word queries: combine lists
• Any: every item with any query word
• All: only items with every word
• Phrases: find only items with all words in
order
Boolean and complex queries
• Use algorithm to combine lists
21
22. Why Searches Fail
Empty search
Nothing on the site on that topic (scope)
Misspelling or typing mistakes
Vocabulary differences
Restrictive search defaults
Restrictive search choices
Software failure
22
23. Relevance Ranking
Theory: sort the matching items, so the most
relevant ones appear first
Can't really know what the user wants
Relevance is hard to define and situational
Short queries tend to be deeply ambiguous
• What do people mean when they type “bank”?
First 10 results are the most important
23
24. Relevance Processing
Sorting documents on various criteria
Start with words matching query terms
Citation and link analysis
• Like old library Citation Indexes
• Not only hypertext, but the links
• Google PageRank
• Incoming links
• Authority of linkers
Taxonomies and external metadata
24
25. Search Results Interface
What users see after they click the Search
button
The most visible part of search
Elements of the results page
• Page layout and navigation
• Results header
• List of results items
• Results footer
25
26. Search Suggestions
Human judgment beats algorithms
Great for frequent, ambiguous searches
• Use search log to identify best candidates
Recommend good starting pages
• Product information, FAQs, etc.
Requires human resources
• That means money and time
More static than algorithmic search
26
27. Search Metrics
Number of searches
Number of matches searches
Traffic from search to high-value pages
Relate search changes to other metrics
27
28. Query Example
Consider the Query Mahendra Singh Dhoni
A good answer contains all the three words, and more
frequently the better, we call this Term Frequency(TF)
Some Query terms are more important those have better
discriminating power than others
For example an answer containing only "Dhoni" is likely to
be better than an answer containing only “Mahendra“
We call this Inverse Document Frequency (IDF)
28
29. Search Will Never Be Perfect
Search engines can’t read minds
• User queries are short and ambiguous
Some things will help
• Design a usable interface
• Show match words in context
• Keep index current and complete
• Adjust heuristic weighting
• Maintain suggestions and synonyms
• Consider faceted metadata search
29