SlideShare uma empresa Scribd logo
1 de 21
1
Web Search
Introduction
2
The World Wide Web
• Developed by Tim Berners-Lee in 1990 at
CERN to organize research documents
available on the Internet.
• Combined idea of documents available by
FTP with the idea of hypertext to link
documents.
• Developed initial HTTP network protocol,
URLs, HTML, and first “web server.”
3
Web Pre-History
• Ted Nelson developed idea of hypertext in 1965.
• Doug Engelbart invented the mouse and built the
first implementation of hypertext in the late 1960’s
at SRI.
• ARPANET was developed in the early 1970’s.
• The basic technology was in place in the 1970’s;
but it took the PC revolution and widespread
networking to inspire the web and make it
practical.
4
Web Browser History
• Early browsers were developed in 1992 (Erwise,
ViolaWWW).
• In 1993, Marc Andreessen and Eric Bina at UIUC
NCSA developed the Mosaic browser and
distributed it widely.
• Andreessen joined with James Clark (Stanford
Prof. and Silicon Graphics founder) to form
Mosaic Communications Inc. in 1994 (which
became Netscape to avoid conflict with UIUC).
• Microsoft licensed the original Mosaic from UIUC
and used it to build Internet Explorer in 1995.
5
Search Engine Early History
• By late 1980’s many files were available by
anonymous FTP.
• In 1990, Alan Emtage of McGill Univ.
developed Archie (short for “archives”)
– Assembled lists of files available on many FTP
servers.
– Allowed regex search of these file names.
• In 1993, Veronica and Jughead were
developed to search names of text files
available through Gopher servers.
6
Web Search History
• In 1993, early web robots (spiders) were
built to collect URL’s:
– Wanderer
– ALIWEB (Archie-Like Index of the WEB)
– WWW Worm (indexed URL’s and titles for
regex search)
• In 1994, Stanford grad students David Filo
and Jerry Yang started manually collecting
popular web sites into a topical hierarchy
called Yahoo.
7
Web Search History (cont.)
• In early 1994, Brian Pinkerton developed
WebCrawler as a class project at U Wash.
(eventually became part of Excite and AOL).
• A few months later, Fuzzy Maudlin, a grad student
at CMU developed Lycos. First to use a standard
IR system as developed for the DARPA Tipster
project. First to index a large set of pages.
• In late 1995, DEC developed Altavista. Used a
large farm of Alpha machines to quickly process
large numbers of queries. Supported boolean
operators, phrases, and “reverse pointer” queries.
8
Web Search History (cont.)
• In 1998, Larry Page and Sergey Brin, Ph.D.
students at Stanford, started Google. Main
advance is use of link analysis to rank
results partially based on authority.
• Microsoft lauched MSN Search in 1998
based on Inktomi (started from UC
Berkeley in 1996), changed to Live Search
in 2007, and Bing in 2009.
9
Web Challenges for IR
• Distributed Data: Documents spread over millions of
different web servers.
• Volatile Data: Many documents change or disappear
rapidly (e.g. dead links).
• Large Volume: Billions of separate documents.
• Unstructured and Redundant Data: No uniform
structure, HTML errors, up to 30% (near) duplicate
documents.
• Quality of Data: No editorial control, false
information, poor quality writing, typos, etc.
• Heterogeneous Data: Multiple media types (images,
video, VRML), languages, character sets, etc.
10
Growth of Web Pages Indexed
SearchEngineWatch
Assuming 20KB per page,
1 billion pages is about 20 terabytes of data.
BillionsofPages
Google
Inktomi
AllTheWeb
Teoma
Altavista
Current Size of the Web
11
12
Zipf’s Law on the Web
• Number of in-links/out-links to/from a page
has a Zipfian distribution.
• Length of web pages has a Zipfian
distribution.
• Number of hits to a web page has a Zipfian
distribution.
13
Zipfs Law and Web Page Popularity
14
“Small World” (Scale-Free) Graphs
• Social networks and six degrees of separation.
– Stanley Milgram Experiment
• Power law distribution of in and out degrees.
• Distinct from purely random graphs.
• “Rich get richer” generation of graphs
(preferential attachment).
• Kevin Bacon game.
– Oracle of Bacon
• Erdos number.
• Networks in biochemistry, roads,
telecommunications, Internet, etc are “small word”
15
Manual Hierarchical Web Taxonomies
• Yahoo approach of using human editors to
assemble a large hierarchically structured
directory of web pages (closed in 2014).
• Open Directory Project is a similar
approach based on the distributed labor of
volunteer editors (“net-citizens provide the
collective brain”). Used by most other
search engines. Started by Netscape.
– http://www.dmoz.org/
16
Business Models for Web Search
• Advertisers pay for banner ads on the site that do not
depend on a user’s query.
– CPM: Cost Per Mille (thousand impressions). Pay for each ad
display.
– CPC: Cost Per Click. Pay only when user clicks on ad.
– CTR: Click Through Rate. Fraction of ad impressions that result in
clicks throughs. CPC = CPM / (CTR * 1000)
– CPA: Cost Per Action (Acquisition). Pay only when user actually
makes a purchase on target site.
• Advertisers bid for “keywords”. Ads for highest bidders
displayed when user query contains a purchased keyword.
– PPC: Pay Per Click. CPC for bid word ads (e.g. Google
AdWords).
17
History of Business Models
• Initially, banner ads paid thru CPM were the norm.
• GoTo Inc. formed in 1997 and originates and patents
bidding and PPC business model.
• Google introduces AdWords in fall 2000.
• GoTo renamed Overture in Oct. 2001.
• Overture sues Google for use of PPC in Apr. 2002.
• Overture acquired by Yahoo in Oct. 2003.
• Google settles with Overture/Yahoo for 2.7 million
shares of Class A common stock in Aug. 2004.
18
Affiliates Programs
• If you have a website, you can generate
income by becoming an affiliate by
agreeing to post ads relevant to the topic of
your site.
• If users click on your impression of an ad,
you get some percentage of the CPC or PPC
income that is generated.
• Google introduces AdSense affiliates
program in 2003.
19
Automatic Document Classification
• Manual classification into a given hierarchy is
labor intensive, subjective, and error-prone.
• Text categorization methods provide a way to
automatically classify documents.
• Best methods based on training a machine
learning (pattern recognition) system on a labeled
set of examples (supervised learning).
• Text categorization is a topic we will discuss later
in the course.
20
Automatic Document Hierarchies
• Manual hierarchy development is labor intensive,
subjective, and error-prone.
• It would nice to automatically construct a
meaningful hierarchical taxonomy from a corpus
of documents.
• This is possible with hierarchical text clustering
(unsupervised learning).
– Hierarchical Agglomerative Clustering (HAC)
• Text clustering is a another topic we will discuss
later in the course.
21
Web Search Using IR
Query
String
IR
System
Ranked
Documents
1. Page1
2. Page2
3. Page3
.
.
Document
corpus
Web Spider

Mais conteúdo relacionado

Mais procurados

Copyright for the Digital Age
Copyright for the Digital AgeCopyright for the Digital Age
Copyright for the Digital AgeGrégory Engels
 
Web 2.0: Legal Issues and Opportunities
Web 2.0: Legal Issues and OpportunitiesWeb 2.0: Legal Issues and Opportunities
Web 2.0: Legal Issues and Opportunitiesmorganjbruce
 
Web 2.0 and pedagogy overview, June 2007
Web 2.0 and pedagogy overview, June 2007Web 2.0 and pedagogy overview, June 2007
Web 2.0 and pedagogy overview, June 2007Bryan Alexander
 
Open World Forum [Followup]: 'require knowledgecommons' # bugfix
Open World Forum [Followup]: 'require knowledgecommons' # bugfixOpen World Forum [Followup]: 'require knowledgecommons' # bugfix
Open World Forum [Followup]: 'require knowledgecommons' # bugfixMike Linksvayer
 
Content & copyright
Content & copyrightContent & copyright
Content & copyrightjadams1957
 
Copyright crash course
Copyright crash  courseCopyright crash  course
Copyright crash coursemgracia5
 
Tim berners lee : Inventor of world wide web
Tim berners lee : Inventor of world wide webTim berners lee : Inventor of world wide web
Tim berners lee : Inventor of world wide webKailash Sahu
 
Web 2.0 and pedagogy overview, Wesleyan 2006
Web 2.0 and pedagogy overview, Wesleyan 2006Web 2.0 and pedagogy overview, Wesleyan 2006
Web 2.0 and pedagogy overview, Wesleyan 2006Bryan Alexander
 
Linked Data and Images: Building Blocks for Cultural Heritage
Linked Data and Images: Building Blocks for Cultural HeritageLinked Data and Images: Building Blocks for Cultural Heritage
Linked Data and Images: Building Blocks for Cultural HeritageRobert Sanderson
 
E marketing the essential guide to online marketing
E marketing   the essential guide to online marketingE marketing   the essential guide to online marketing
E marketing the essential guide to online marketingrealtop466
 
Indianapolis - Wikipedia and the Cultural Sector
Indianapolis - Wikipedia and the Cultural SectorIndianapolis - Wikipedia and the Cultural Sector
Indianapolis - Wikipedia and the Cultural Sectorwittylama
 
The Vortex: Experiments in Online Collaboration
The Vortex: Experiments in Online CollaborationThe Vortex: Experiments in Online Collaboration
The Vortex: Experiments in Online CollaborationNancy Wright White
 
The future of the internet: version 4
The future of the internet: version 4The future of the internet: version 4
The future of the internet: version 4Jon Lebkowsky
 
Tim Berners Lee, by Filippo C.
Tim Berners Lee, by Filippo C.Tim Berners Lee, by Filippo C.
Tim Berners Lee, by Filippo C.Mrs Congedo
 
Web 2.0 and pedagogy overview, Wesleyan 2006
Web 2.0 and pedagogy overview, Wesleyan 2006Web 2.0 and pedagogy overview, Wesleyan 2006
Web 2.0 and pedagogy overview, Wesleyan 2006Bryan Alexander
 

Mais procurados (20)

Copyright for the Digital Age
Copyright for the Digital AgeCopyright for the Digital Age
Copyright for the Digital Age
 
Tim Berners Lee
Tim Berners LeeTim Berners Lee
Tim Berners Lee
 
Web 2.0: Legal Issues and Opportunities
Web 2.0: Legal Issues and OpportunitiesWeb 2.0: Legal Issues and Opportunities
Web 2.0: Legal Issues and Opportunities
 
Web 2.0 and pedagogy overview, June 2007
Web 2.0 and pedagogy overview, June 2007Web 2.0 and pedagogy overview, June 2007
Web 2.0 and pedagogy overview, June 2007
 
Open World Forum [Followup]: 'require knowledgecommons' # bugfix
Open World Forum [Followup]: 'require knowledgecommons' # bugfixOpen World Forum [Followup]: 'require knowledgecommons' # bugfix
Open World Forum [Followup]: 'require knowledgecommons' # bugfix
 
Content & copyright
Content & copyrightContent & copyright
Content & copyright
 
Fontys Library Portal
Fontys Library PortalFontys Library Portal
Fontys Library Portal
 
Tim Berners Lee
Tim Berners  LeeTim Berners  Lee
Tim Berners Lee
 
Copyright crash course
Copyright crash  courseCopyright crash  course
Copyright crash course
 
What is web2
What is web2What is web2
What is web2
 
Tim berners lee : Inventor of world wide web
Tim berners lee : Inventor of world wide webTim berners lee : Inventor of world wide web
Tim berners lee : Inventor of world wide web
 
Web 2.0 and pedagogy overview, Wesleyan 2006
Web 2.0 and pedagogy overview, Wesleyan 2006Web 2.0 and pedagogy overview, Wesleyan 2006
Web 2.0 and pedagogy overview, Wesleyan 2006
 
Internet history
Internet historyInternet history
Internet history
 
Linked Data and Images: Building Blocks for Cultural Heritage
Linked Data and Images: Building Blocks for Cultural HeritageLinked Data and Images: Building Blocks for Cultural Heritage
Linked Data and Images: Building Blocks for Cultural Heritage
 
E marketing the essential guide to online marketing
E marketing   the essential guide to online marketingE marketing   the essential guide to online marketing
E marketing the essential guide to online marketing
 
Indianapolis - Wikipedia and the Cultural Sector
Indianapolis - Wikipedia and the Cultural SectorIndianapolis - Wikipedia and the Cultural Sector
Indianapolis - Wikipedia and the Cultural Sector
 
The Vortex: Experiments in Online Collaboration
The Vortex: Experiments in Online CollaborationThe Vortex: Experiments in Online Collaboration
The Vortex: Experiments in Online Collaboration
 
The future of the internet: version 4
The future of the internet: version 4The future of the internet: version 4
The future of the internet: version 4
 
Tim Berners Lee, by Filippo C.
Tim Berners Lee, by Filippo C.Tim Berners Lee, by Filippo C.
Tim Berners Lee, by Filippo C.
 
Web 2.0 and pedagogy overview, Wesleyan 2006
Web 2.0 and pedagogy overview, Wesleyan 2006Web 2.0 and pedagogy overview, Wesleyan 2006
Web 2.0 and pedagogy overview, Wesleyan 2006
 

Semelhante a Web Search: Introduction

Web search engines ( Mr.Mirza )
Web search engines ( Mr.Mirza )Web search engines ( Mr.Mirza )
Web search engines ( Mr.Mirza )Ali Saif Mirza
 
Familiarization with Web Tools
Familiarization with Web ToolsFamiliarization with Web Tools
Familiarization with Web ToolsMarlon Jamera
 
Search engines powerpoint
Search engines powerpointSearch engines powerpoint
Search engines powerpointvbaker2210
 
Internet and its history
Internet and its historyInternet and its history
Internet and its historyananyatodi2
 
Internet and its applications
Internet and its applicationsInternet and its applications
Internet and its applicationsBurhan Ahmed
 
Internet tech & web prog. p1,2,3-ver1
Internet tech & web prog.  p1,2,3-ver1Internet tech & web prog.  p1,2,3-ver1
Internet tech & web prog. p1,2,3-ver1Taymoor Nazmy
 
An overview of the development of the world wide web
An overview of the development of the world wide webAn overview of the development of the world wide web
An overview of the development of the world wide webMarbin Colah
 
Mm ch 10 internet
Mm ch 10 internetMm ch 10 internet
Mm ch 10 internetJason Nix
 
What happened to the Semantic Web?
What happened to the Semantic Web?What happened to the Semantic Web?
What happened to the Semantic Web?Peter Mika
 
Trends and advancements in www.pptx
Trends and advancements in www.pptxTrends and advancements in www.pptx
Trends and advancements in www.pptxAncyTEnglish
 
Empowering technlogies 2016 2017
Empowering technlogies 2016 2017Empowering technlogies 2016 2017
Empowering technlogies 2016 2017JamieChavez7
 
Trends and advancements in www.pptx
Trends and advancements in www.pptxTrends and advancements in www.pptx
Trends and advancements in www.pptxARYAASEnglish
 

Semelhante a Web Search: Introduction (20)

Web search engines ( Mr.Mirza )
Web search engines ( Mr.Mirza )Web search engines ( Mr.Mirza )
Web search engines ( Mr.Mirza )
 
Search Engine Optimization Virtuo.Ca
Search Engine Optimization Virtuo.CaSearch Engine Optimization Virtuo.Ca
Search Engine Optimization Virtuo.Ca
 
Familiarization with Web Tools
Familiarization with Web ToolsFamiliarization with Web Tools
Familiarization with Web Tools
 
Search engines powerpoint
Search engines powerpointSearch engines powerpoint
Search engines powerpoint
 
Internet and Its Applications
Internet and Its ApplicationsInternet and Its Applications
Internet and Its Applications
 
Internet and its history
Internet and its historyInternet and its history
Internet and its history
 
Webtech
WebtechWebtech
Webtech
 
The Dynamic Web
The Dynamic WebThe Dynamic Web
The Dynamic Web
 
Dynamic Web
Dynamic WebDynamic Web
Dynamic Web
 
Web technology unit I - Part A
Web technology unit I -  Part AWeb technology unit I -  Part A
Web technology unit I - Part A
 
Internet and its applications
Internet and its applicationsInternet and its applications
Internet and its applications
 
Webcrawler
WebcrawlerWebcrawler
Webcrawler
 
Internet tech & web prog. p1,2,3-ver1
Internet tech & web prog.  p1,2,3-ver1Internet tech & web prog.  p1,2,3-ver1
Internet tech & web prog. p1,2,3-ver1
 
Search Engine
Search EngineSearch Engine
Search Engine
 
An overview of the development of the world wide web
An overview of the development of the world wide webAn overview of the development of the world wide web
An overview of the development of the world wide web
 
Mm ch 10 internet
Mm ch 10 internetMm ch 10 internet
Mm ch 10 internet
 
What happened to the Semantic Web?
What happened to the Semantic Web?What happened to the Semantic Web?
What happened to the Semantic Web?
 
Trends and advancements in www.pptx
Trends and advancements in www.pptxTrends and advancements in www.pptx
Trends and advancements in www.pptx
 
Empowering technlogies 2016 2017
Empowering technlogies 2016 2017Empowering technlogies 2016 2017
Empowering technlogies 2016 2017
 
Trends and advancements in www.pptx
Trends and advancements in www.pptxTrends and advancements in www.pptx
Trends and advancements in www.pptx
 

Último

AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...masabamasaba
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...masabamasaba
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyviewmasabamasaba
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benonimasabamasaba
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxAnnaArtyushina1
 
tonesoftg
tonesoftgtonesoftg
tonesoftglanshi9
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareJim McKeeth
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024VictoriaMetrics
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...masabamasaba
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2
 

Último (20)

AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 

Web Search: Introduction

  • 2. 2 The World Wide Web • Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet. • Combined idea of documents available by FTP with the idea of hypertext to link documents. • Developed initial HTTP network protocol, URLs, HTML, and first “web server.”
  • 3. 3 Web Pre-History • Ted Nelson developed idea of hypertext in 1965. • Doug Engelbart invented the mouse and built the first implementation of hypertext in the late 1960’s at SRI. • ARPANET was developed in the early 1970’s. • The basic technology was in place in the 1970’s; but it took the PC revolution and widespread networking to inspire the web and make it practical.
  • 4. 4 Web Browser History • Early browsers were developed in 1992 (Erwise, ViolaWWW). • In 1993, Marc Andreessen and Eric Bina at UIUC NCSA developed the Mosaic browser and distributed it widely. • Andreessen joined with James Clark (Stanford Prof. and Silicon Graphics founder) to form Mosaic Communications Inc. in 1994 (which became Netscape to avoid conflict with UIUC). • Microsoft licensed the original Mosaic from UIUC and used it to build Internet Explorer in 1995.
  • 5. 5 Search Engine Early History • By late 1980’s many files were available by anonymous FTP. • In 1990, Alan Emtage of McGill Univ. developed Archie (short for “archives”) – Assembled lists of files available on many FTP servers. – Allowed regex search of these file names. • In 1993, Veronica and Jughead were developed to search names of text files available through Gopher servers.
  • 6. 6 Web Search History • In 1993, early web robots (spiders) were built to collect URL’s: – Wanderer – ALIWEB (Archie-Like Index of the WEB) – WWW Worm (indexed URL’s and titles for regex search) • In 1994, Stanford grad students David Filo and Jerry Yang started manually collecting popular web sites into a topical hierarchy called Yahoo.
  • 7. 7 Web Search History (cont.) • In early 1994, Brian Pinkerton developed WebCrawler as a class project at U Wash. (eventually became part of Excite and AOL). • A few months later, Fuzzy Maudlin, a grad student at CMU developed Lycos. First to use a standard IR system as developed for the DARPA Tipster project. First to index a large set of pages. • In late 1995, DEC developed Altavista. Used a large farm of Alpha machines to quickly process large numbers of queries. Supported boolean operators, phrases, and “reverse pointer” queries.
  • 8. 8 Web Search History (cont.) • In 1998, Larry Page and Sergey Brin, Ph.D. students at Stanford, started Google. Main advance is use of link analysis to rank results partially based on authority. • Microsoft lauched MSN Search in 1998 based on Inktomi (started from UC Berkeley in 1996), changed to Live Search in 2007, and Bing in 2009.
  • 9. 9 Web Challenges for IR • Distributed Data: Documents spread over millions of different web servers. • Volatile Data: Many documents change or disappear rapidly (e.g. dead links). • Large Volume: Billions of separate documents. • Unstructured and Redundant Data: No uniform structure, HTML errors, up to 30% (near) duplicate documents. • Quality of Data: No editorial control, false information, poor quality writing, typos, etc. • Heterogeneous Data: Multiple media types (images, video, VRML), languages, character sets, etc.
  • 10. 10 Growth of Web Pages Indexed SearchEngineWatch Assuming 20KB per page, 1 billion pages is about 20 terabytes of data. BillionsofPages Google Inktomi AllTheWeb Teoma Altavista
  • 11. Current Size of the Web 11
  • 12. 12 Zipf’s Law on the Web • Number of in-links/out-links to/from a page has a Zipfian distribution. • Length of web pages has a Zipfian distribution. • Number of hits to a web page has a Zipfian distribution.
  • 13. 13 Zipfs Law and Web Page Popularity
  • 14. 14 “Small World” (Scale-Free) Graphs • Social networks and six degrees of separation. – Stanley Milgram Experiment • Power law distribution of in and out degrees. • Distinct from purely random graphs. • “Rich get richer” generation of graphs (preferential attachment). • Kevin Bacon game. – Oracle of Bacon • Erdos number. • Networks in biochemistry, roads, telecommunications, Internet, etc are “small word”
  • 15. 15 Manual Hierarchical Web Taxonomies • Yahoo approach of using human editors to assemble a large hierarchically structured directory of web pages (closed in 2014). • Open Directory Project is a similar approach based on the distributed labor of volunteer editors (“net-citizens provide the collective brain”). Used by most other search engines. Started by Netscape. – http://www.dmoz.org/
  • 16. 16 Business Models for Web Search • Advertisers pay for banner ads on the site that do not depend on a user’s query. – CPM: Cost Per Mille (thousand impressions). Pay for each ad display. – CPC: Cost Per Click. Pay only when user clicks on ad. – CTR: Click Through Rate. Fraction of ad impressions that result in clicks throughs. CPC = CPM / (CTR * 1000) – CPA: Cost Per Action (Acquisition). Pay only when user actually makes a purchase on target site. • Advertisers bid for “keywords”. Ads for highest bidders displayed when user query contains a purchased keyword. – PPC: Pay Per Click. CPC for bid word ads (e.g. Google AdWords).
  • 17. 17 History of Business Models • Initially, banner ads paid thru CPM were the norm. • GoTo Inc. formed in 1997 and originates and patents bidding and PPC business model. • Google introduces AdWords in fall 2000. • GoTo renamed Overture in Oct. 2001. • Overture sues Google for use of PPC in Apr. 2002. • Overture acquired by Yahoo in Oct. 2003. • Google settles with Overture/Yahoo for 2.7 million shares of Class A common stock in Aug. 2004.
  • 18. 18 Affiliates Programs • If you have a website, you can generate income by becoming an affiliate by agreeing to post ads relevant to the topic of your site. • If users click on your impression of an ad, you get some percentage of the CPC or PPC income that is generated. • Google introduces AdSense affiliates program in 2003.
  • 19. 19 Automatic Document Classification • Manual classification into a given hierarchy is labor intensive, subjective, and error-prone. • Text categorization methods provide a way to automatically classify documents. • Best methods based on training a machine learning (pattern recognition) system on a labeled set of examples (supervised learning). • Text categorization is a topic we will discuss later in the course.
  • 20. 20 Automatic Document Hierarchies • Manual hierarchy development is labor intensive, subjective, and error-prone. • It would nice to automatically construct a meaningful hierarchical taxonomy from a corpus of documents. • This is possible with hierarchical text clustering (unsupervised learning). – Hierarchical Agglomerative Clustering (HAC) • Text clustering is a another topic we will discuss later in the course.
  • 21. 21 Web Search Using IR Query String IR System Ranked Documents 1. Page1 2. Page2 3. Page3 . . Document corpus Web Spider