Gives a brief introduction into search engines and information retrieval. Covers basics about Google and Yahoo, fundamental terms in the area of information retrieval and an introduction into the famous page rank algorithm
3. Search Engines Overview
deep impact (not only for search)
developers in big challenge
search engines getting larger
problems not new
4. History
The web happened (1992)
Mosaic/Netscape happened (1993-95)
Crawler happened (1994): M. Mauldin
SEs happened 1994-1996
– InfoSeek, Lycos, Altavista, Excite, Inktomi, …
Yahoo decided to go with a directory
Google happened 1996-98
Tried selling technology to other engines
SEs though search was a commodity, portals were in
Microsoft said: whatever …
5. Present
Most search engines have vanished
Google is a big player
Yahoo decided to de-emphasize directories
Buys three search engines
Microsoft realized Internet is here to stay
Dominates the browser market
Realizes search is critical
7. Google
first launched Sep. 1999
Over 4 billion pages by beginning of 2004
strengths
size and scope
relevance based
cached archive
weaknesses
limited search features
only indexes first 101KB of sites and PDFs
8. Yahoo!
David Filo, Jerry Yang => 1995
originally just a subject directory
strengths
large, new(Feb. 2004) database
cached copies
support of full boolean searching
weaknesses
lack of some advanced search features
indexes only the first 500KB
tricky wildcard
9. MSN Search
used to use third party db´s
Feb. 2005 began using own db
strenghts
large, unique database
cached copies including data cached
weaknesses
limited advanced features
no title search, truncation, stemming
10. How Search Engines Work
Crawler-Based Search Engines
listing created automatically
Human-Powered Directories
contents filled by hand
"Hybrid Search Engines" Or Mixed Results
best of both worlds
11. Ranking Of Sites
location and frequency of keywords
keywords near top of page
spamming filter
„off the page“ ranking
link structure
filtering fake links
clickthrough measurement
12. Search Engine Placement Tips (1)
pick your target keywords
position your keywords
have relevant content
avoid search engine stumbling blocks
have html links
frames can kill
dynamic doorblocks
13. Search Engine Placement Tips (2)
build links
just say no to search engine spamming
submit your key pages
verify & maintain your listing
beyond search engines
14. Features for webmasters
Crawling Yes No Notes
AllTheWeb, Google,
Deep Crawl AltaVista, Teoma
Inktomi
Frames Support All n/a
Robots.txt All n/a
Meta Robots Tag All n/a
Paid Inclusion All but… Google
Some stop words may
Full Body Text All n/a
not be indexed
AltaVista, Inktomi,
Stop Words FAST Teoma unkown
Google
All provide some support, but AltaVista, AllTheWeb and Teoma make most
Meta Description
use of the tag
AllTheWeb, Altavista, Teoma support is
Meta Keywords Inktomi, Teoma
Google „unofficial“
AltaVista, Google,
ALT text AllTheWeb, Inktomi
Teoma
Comments Inktomi Others
15. What is Information Retrieval?
Informations get lost in the amount of
documents, but have to be relocated
Definition:
IR is the field, that deals with the relocation of
information/knowledge out of large document
database.
16. Quality of an IR-System (1)
Precision:
Is the ratio of the relevant documents retrieved
to the total number of documents retrieved.
= [0;1]
Precision = 1: all retrieved documents are
relevant
17. Quality of an IR-System (2)
Recall:
Is the ratio of the number of relevant
documents retrieved to the total number of
relevant documents (retrieved and not).
= [0;1]
Recall = 1: all relevant documents were found
18. Quality of an IR-System (3)
Aim of a good IR-System:
increasing Precision and Recall!
Problem:
increasing Precision cause a decrease of Recall
e.g.: search results 1 document:
Recall->0, Precision=1
increasing Recall cause a decrease of Precision
e.g. search results all available documents
Recall=1, Precision->0
20. Boolean model
checks if the document includes the search
term (true) or not (false). True means, the
document is relevant
Problem:
high variation on the result size, depending on
the search term
no ranking on result set -> no sort possible
“relevance” criteria is too strict (e.g. AND,OR)
21. Vector space model (1)
index weighted vector
dj = ( w1, j , w2 , j , w3, j , wn , j )
search weighted vector
q = ( w1, q, w2 , q, w3, q, wn , q )
analyze the angle between search vector and
document vector by using the cosine function
the smaller the angle, the more relevant is the
document -> use it for ranking
22. Vector space model (2)
“relevance” criteria is more tolerant
no use of boolean operators
uses weighting
creates a ranking -> sort is possible
Problem:
automatic weighting of index terms in queries
and documents
23. Weighting Methods (1)
law of Zipf
global weighting (IDF “inverse document
frequency”)
considers the distribution of words in a
language
filters out words like “or”, “and” (words with
large occurrence) and weights them weakly
IDF = log( N / n)
N = Number of documents in the system
n = number of documents including the index term
24. Weighting Methods (2)
local weighting
considers term frequency into documents
weighting corresponding to the frequency
regards different length of documents and
normalize the term frequency
tfi , j
ntfi , j =
max l ∈ {1... n }tfl , j
tfi , j = absolute number of term frequency ti in a document di
25. Weighting Methods (3)
tf-idf weighting
combination of global (inverse document
frequency) and local (normalized term
frequency) weighting
wi , j = ntfi , j ∗ idfi
26. Web-Mining
Web-Mining ≈ Data-Mining, different problems
Mining of: Content, Structure or User
Content-Mining: VSM,BM
Structure-Mining: Analysis of Structure
User-Mining: Infos about User of a page
Let‘s have a deeper look at Web-Structure-Mining!
27. History
IR necessary but not sufficient for web search
Doesn’t address web navigation
Query ibm seeks www.ibm.com
To IR www.ibm.com may look less topical than a
quarterly report
Link analysis
Hubs and authority (Jon Kleinberg)
PageRank (Brin and Page)
Computed on the entire graph
Query independent
Faster if serving lots of queries
Others…
28. Analysis of Hyperlinks
Links
Long history in citation analysis
Navigational tools on the web
Also a sign of popularity
Can be thought of as recommendations
(source recommends destination)
Also describe the destination: anchor text
Idea: The exist of a Hyperlink between two
pages can also give Information
Hyperlinks can be used to:
Create a weighting of web pages
Find pages with similiar topics
Group pages by different context of meaning
29. Hubs and Authorities
Describe the qualitiy of a
website
Authorities: pages which
is linked very often
Hubs: pages which are
linking other pages very
often
Example:
Authority: Heise.de
Hub: Peter‘s Linklist
30. Page Rank
Invented by Lawrence Page a. Sergey Brin
Algorithm itself is well-described
Implementations are not (Google)
Main Idea:
relationship of all Links in WWW
The more a document is linked, the more important it is
Not every link counts the same – a link from an
important page has more worth
31. Page Rank Algorithm
PR(p0) : Page Rank of a page
PR(pi) : Page Rank of pages linking to p0
outlink(pi): All outgoing links of pi
q = Random walks (normally q=0,85)
Attention: Recursive Function!
36. Page Rank other Examples
Dangling Links
Different
hierachies
37. Page Rank Implementation
Normally implemented as weighting system
Additional content-search needed for
retrieving the document set
Also involved in Page Rank
The markup of a link
The position of a link in the document
The distance between the pages (e.g. other
domain)
The context of the linking page
The actuality of the page
43. Google by Numbers
Index: 40 TB (4 Bill. Pages with est. Size 10 kb)
Up to 2000 Servers in one Cluster
Over 30 Cluster
One Petabyte Data per Cluster – so much that a
quota of hard disk breakdowns with 1 in 10-15 Bits
gets a real problem
Each day in each greater cluster normally two
servers will breakdown
System running stable (without any breakdowns)
since February 2000 (Yes, they don’t use Windows
server…)
44. Look-out: Semantic Web
Information should be read by men &
machine
Unified description of data & knowledge
First approaches: Meta-Data, e.g. Dublin
Core
Actual: RDF
45. Look-out: Personalized Search Engine
A new approach: personalized Search
Engines
Advantage: Only get in what you‘re personally
interested
Disadvantage: A lot of data has to be
collected
Example:
www.fooxx.com
46. Links
www.searchenginewatch.com (common
Information about search engines)
http://pr.efactory.de (page rank algorithm)
http://zdnet.de/itmanager/unternehmen/0,3902344
(article: “Google’s Technologien: Von
Zauberei kaum zu unterscheiden”)