How search engines work Anand Saini

Helping people find what they’re looking for
 Starts with an “information need”
 Convert to a query
 Gets results
In the materials available
 Web pages
 Other formats
 Deep Web

 Search can’t find what’s not there
 The content is hugely important
 Information Architecture is vital
 Usable sites have good navigation and structure

Index ahead of time
• Find files or records
• Open each one and read it
• Store each word in a searchable index
Provide search forms
• Match the query terms with words in the index
• Sort documents by relevance
Display results

Like an iceberg,
2/3 below water

user
interface

search
content functionality

• Text search works for structured content
• Keyword search vs. SQL queries
• Approximate vs. exact match
• Multiple sources of content
• Response time and database resources
• Relevance ranking, very important
• Works in the real world (e.g. EBay)

Users blame the search engine
 Even when the content is unavailable
Understand the scope of site or intranet
 Kinds of information
 Divided sites: products / corporate info
 Dates
 Languages
 Sources and data silos: CMSs, databases...
 Update processes

Store text to search it later
Many ways to gather text
 Crawl (spider) via HTTP
 Read files on file servers
 Access databases (HTTP or API)
 Data silos via local APIs
 Applications, CMSs, via Web Services
Security and Access Control

 Basic information for document or record
• File name / URL / record ID
• Title or equivalent
• Size, date, MIME type
 Full text of item
 More metadata
• Product name, picture ID
• Category, topic, or subject
• Other attributes, for relevance ranking and display

Stop words
Stemming
Metadata
 Explicit (tags)
 Implicit (context)
Semantics
 CMS and Database fields
 XML tags and attributes

What happens after you click the search button and
before retrieval starts.
Usually in this order
 Handle character set, maybe language
 Look for operators and organize the query
 Look for field names or metadata
 Extract words (just like the indexer)
 Deal with letter casing

• Retrieval: find files with query terms
• Not the same as relevance ranking
Recall: find all
relevant items
Precision: find only
relevant items
Increasing one
decreases the
other

Single-word queries
 Find items containing that word
Multi-word queries: combine lists
 Any: every item with any query word
 All: only items with every word
 Phrases: find only items with all words in order
Boolean and complex queries
– Use algorithm to combine lists

• Empty search
• Nothing on the site on that topic (scope)
• Misspelling or typing mistakes
• Vocabulary differences
• Restrictive search defaults
• Restrictive search choices
• Software failure

Theory: sort the matching items, so the most
relevant ones appear first
Can't really know what the user wants
Relevance is hard to define and situational
Short queries tend to be deeply ambiguous
What do people mean when they type “bank”?
First 10 results are the most important
The more transparent, the better

 Sorting documents on various criteria
 Start with words matching query terms
 Citation and link analysis
 Like old library Citation Indexes
 Ted Nelson - not only hypertext, but the links
 Google PageRank
 Incoming links
 Authority of linkers
 Taxonomies and external metadata

• Term frequency in the item
• Inverse document frequency of term
 Rare words are likely to be more important
wij = weight of Term Tj in Document Di
tfij = frequency of Term Tj in Document Dj
N = number of Documents in collection
n = number of Documents where term Tj
occurs at least once

From Salton 1989

• Vector space
• Probabilistic (binary interdependence)
• Fuzzy set theory
• Bayesian statistical analysis
• Latent semantic indexing
• Neural networks
• Machine learning
• All require sophisticated queries
• See MIR, chapter 2

Heuristics are rules of thumb
• Not algorithms, not math
Search Relevance Ranking Heuristics
• Documents containing all search words
• Search words as a phrase
• Matches in title tag
• Matches in other metadata
Based on real-word user behavior

What users see after they click the Search button
The most visible part of search
Elements of the results page
 Page layout and navigation
 Results header
 List of results items
 Results footer

Human judgment beats algorithms
Great for frequent, ambiguous searches
 Use search log to identify best candidates
Recommend good starting pages
 Product information, FAQs, etc.
Requires human resources
 That means money and time
More static than algorithmic search

 Leverage content structure
 database fields (i.e. cruise amenities)
 document metadata (news article bylines)
 Provide both search and browse
 Support information foraging
 Integrate navigation with results
 Not just subject taxonomies
 Display only fruitful paths, no dead ends
 Supported by academic research
 Marti Hearst, UCB SIMS, flamenco.berkeley.edu

Metrics
 Number of searches
 Number of no-matches searches
 Traffic from search to high-value pages
 Relate search changes to other metrics
Search Log Analysis
 Top 5% searches: phrases and words
 Top no-matches searches
 Use as market research

Search engines can’t read minds
 User queries are short and ambiguous
Some things will help
 Design a usable interface
 Show match words in context
 Keep index current and complete
 Adjust heuristic weighting
 Maintain suggestions and synonyms
 Consider faceted metadata search

Join us
Add: WZ-30-a,Bhagwan Das Nagar
East Punjabi Bagh, Delhi-110026
Tel.: 011 28316148, 3203571, 30538061
Mobile; +91-8010 298 388, 8010 198 388
E-mail: info@seocertification.org.in

How search engines work Anand Saini

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (18)

Destaque

Destaque (17)

Semelhante a How search engines work Anand Saini

Semelhante a How search engines work Anand Saini (20)

Mais de Dr,Saini Anand

Mais de Dr,Saini Anand (20)

Último

Último (20)

How search engines work Anand Saini