2. Helping people find what they’re looking for
Starts with an “information need”
Convert to a query
Gets results
In the materials available
Web pages
Other formats
Deep Web
3. Search can’t find what’s not there
The content is hugely important
Information Architecture is vital
Usable sites have good navigation and structure
4.
5. Index ahead of time
• Find files or records
• Open each one and read it
• Store each word in a searchable index
Provide search forms
• Match the query terms with words in the index
• Sort documents by relevance
Display results
8. • Text search works for structured content
• Keyword search vs. SQL queries
• Approximate vs. exact match
• Multiple sources of content
• Response time and database resources
• Relevance ranking, very important
• Works in the real world (e.g. EBay)
9. Users blame the search engine
Even when the content is unavailable
Understand the scope of site or intranet
Kinds of information
Divided sites: products / corporate info
Dates
Languages
Sources and data silos: CMSs, databases...
Update processes
10. Store text to search it later
Many ways to gather text
Crawl (spider) via HTTP
Read files on file servers
Access databases (HTTP or API)
Data silos via local APIs
Applications, CMSs, via Web Services
Security and Access Control
11.
12. Basic information for document or record
• File name / URL / record ID
• Title or equivalent
• Size, date, MIME type
Full text of item
More metadata
• Product name, picture ID
• Category, topic, or subject
• Other attributes, for relevance ranking and display
13.
14.
15. Stop words
Stemming
Metadata
Explicit (tags)
Implicit (context)
Semantics
CMS and Database fields
XML tags and attributes
16. What happens after you click the search button and
before retrieval starts.
Usually in this order
Handle character set, maybe language
Look for operators and organize the query
Look for field names or metadata
Extract words (just like the indexer)
Deal with letter casing
17. • Retrieval: find files with query terms
• Not the same as relevance ranking
Recall: find all
relevant items
Precision: find only
relevant items
Increasing one
decreases the
other
18. Single-word queries
Find items containing that word
Multi-word queries: combine lists
Any: every item with any query word
All: only items with every word
Phrases: find only items with all words in order
Boolean and complex queries
– Use algorithm to combine lists
19. • Empty search
• Nothing on the site on that topic (scope)
• Misspelling or typing mistakes
• Vocabulary differences
• Restrictive search defaults
• Restrictive search choices
• Software failure
20.
21. Theory: sort the matching items, so the most
relevant ones appear first
Can't really know what the user wants
Relevance is hard to define and situational
Short queries tend to be deeply ambiguous
What do people mean when they type “bank”?
First 10 results are the most important
The more transparent, the better
22. Sorting documents on various criteria
Start with words matching query terms
Citation and link analysis
Like old library Citation Indexes
Ted Nelson - not only hypertext, but the links
Google PageRank
Incoming links
Authority of linkers
Taxonomies and external metadata
23. • Term frequency in the item
• Inverse document frequency of term
Rare words are likely to be more important
wij = weight of Term Tj in Document Di
tfij = frequency of Term Tj in Document Dj
N = number of Documents in collection
n = number of Documents where term Tj
occurs at least once
From Salton 1989
24. • Vector space
• Probabilistic (binary interdependence)
• Fuzzy set theory
• Bayesian statistical analysis
• Latent semantic indexing
• Neural networks
• Machine learning
• All require sophisticated queries
• See MIR, chapter 2
25. Heuristics are rules of thumb
• Not algorithms, not math
Search Relevance Ranking Heuristics
• Documents containing all search words
• Search words as a phrase
• Matches in title tag
• Matches in other metadata
Based on real-word user behavior
26. What users see after they click the Search button
The most visible part of search
Elements of the results page
Page layout and navigation
Results header
List of results items
Results footer
27.
28.
29. Human judgment beats algorithms
Great for frequent, ambiguous searches
Use search log to identify best candidates
Recommend good starting pages
Product information, FAQs, etc.
Requires human resources
That means money and time
More static than algorithmic search
30.
31.
32.
33.
34. Leverage content structure
database fields (i.e. cruise amenities)
document metadata (news article bylines)
Provide both search and browse
Support information foraging
Integrate navigation with results
Not just subject taxonomies
Display only fruitful paths, no dead ends
Supported by academic research
Marti Hearst, UCB SIMS, flamenco.berkeley.edu
35.
36.
37. Metrics
Number of searches
Number of no-matches searches
Traffic from search to high-value pages
Relate search changes to other metrics
Search Log Analysis
Top 5% searches: phrases and words
Top no-matches searches
Use as market research
38. Search engines can’t read minds
User queries are short and ambiguous
Some things will help
Design a usable interface
Show match words in context
Keep index current and complete
Adjust heuristic weighting
Maintain suggestions and synonyms
Consider faceted metadata search
39. Join us
Add: WZ-30-a,Bhagwan Das Nagar
East Punjabi Bagh, Delhi-110026
Tel.: 011 28316148, 3203571, 30538061
Mobile; +91-8010 298 388, 8010 198 388
E-mail: info@seocertification.org.in