SlideShare uma empresa Scribd logo
1 de 40
Baixar para ler offline
Web search engines

  Alexander Tolmachev
       gr. #3057/2
Contents

   Introduction: what do web search engines
    mean for us today?
   History of web search engines
   How web search engines work
   Most popular search engines
   Conclusion: past, present and future of web
    search


                                                  2
Contents

➔   Introduction: what do web search engines
    mean for us today?
   History of web search engines
   How web search engines work
   Most popular search engines
   Conclusion: past, present and future of web
    search


                                                  3
The Web as a huge storage of
information
   A huge amount of information is contained in
    the Word Wide Web
   And this amount is still growing
    day by day
   We need to orient ourself in this enormous
    information space
   Web search engines provide us fast
    search of information that we are
    interested in
                                                   4
Web search engines in our life
   We use web search engines every day for:
       Searching texts, articles, books, news, etc.
       Searching different media: music, videos, films,
        pictures, etc.
       Searching goods
       Searching web sites and web portals
       Preparing lectures and presentations ☺
       …
   The verb “to google” is included in dictionaries
   Web search engines have become an integral
    part of our life                                       5
Contents

✔   Introduction: what do web search engines
    mean for us today?
➔   History of web search engines
   How web search engines work
   Most popular search engines
   Conclusion: past, present and future of web
    search


                                                  6
The very first search tools

   1989–1991 – the invention of the World Wide
    Web by Sir Tim Berners-Lee in CERN
   Archie (1990)
       The first Internet search tool
       Fetching and indexing files on FTP servers
       Providing search for indexed files
   Veronica and Jughead – similar to Archie search
    tools for Gopher protocol invented in 1991

                                                     7
The first web search engines

   W3Catalog (1993)
       The first primitive search engine
       Mirroring and integration of manually maintained
        catalogues
       Still available: http://www.w3catalog.com/
   World Wide Web Wanderer (1993)
       The first web crawler
       The first web index called Wandex
       Aimed to count Web size, not to serve as a search
        tool
                                                            8
The first web search engines
   JumpStation (1993)
       The first web search engine combining crawling,
        indexing and searching
       A web form for search queries
       No ranking, just listing search results
   Excite (1994)
       The first ranking system
   WebCrawler (1994)
       Indexing full text
       The first widely known web search engine
                                                          9
Web search evolution

   1994–1997 – a number of similar web search
    engines:
       Infoseek
       OpenText
       Magellan
       Inktomi
       Northern Light
       AskJeeves
       AltaVista
                                                 10
Web search evolution

   Yahoo! (1994)
       Search in human edited hierarchical web directory
       Manual solution of relevancy
       Search by keywords as well as browsing full
        directory
       Gained large popularity
       Later in 2004 developed its own web search engine
       One of the main stars in business world in 1990s


                                                            11
Web search evolution

   Google (1998)
       The invention of Page Rank
       Simple and clear interface instead of turning to a
        web portal
   Yandex (1997)
       Full-text search with Russian morphology support
       Quickly gained large popularity in Russia



                                                             12
Web search engines today
   Powerful web search technologies
       Maximal freshness of results
       Variety of types of searchable documents
       Intelligent algorithms of ranking
   Media search:
       Images
       Music
       Videos
       …
                                                   13
Web search engines today

   Personalized search
      Based on user's search history
      Based on personal information from virtual

       social spaces
   Location-based search
   Vertical search
   Image-based search
   Audio-based search
                                                    14
Contents

✔   Introduction: what do web search engines
    mean for us today?
✔   History of web search engines
➔   How web search engines work
   Most popular search engines
   Conclusion: past, present and future of web
    search


                                                  15
Basic principles of web search

   Create and sort a pool of data
   Find the most appropriate information
   Deliver this information




                                            16
Basic parts of web search engine
   A web spider/crawler/robot – a computer
    program which:
       Continuously traverses web pages
       Finds new or changed content
       Stores visited pages in corpus
   Index – a database containing crawling results
   Search engine – a computer program which:
       Identifies pages relevant to search query
       Retrieve this pages
       Rank them
   User interface                                   17
Web crawling
   Web crawling is aimed to traverse web pages
    and to store their copies for further indexing
   General web crawler algorithm:
       Starts with a list of initial URLs, called
        the seeds
       Visits these URLs
       Retrieves required information from the page
       Identifies all the hyper-links on the page
       Adds this links to the queue of URLs, called the
        crawl frontier
       Recursively visit URLs from the crawl frontier     18
Web crawler architecture




                           19
Crawling policies
   A selection policy
       Focused crawling
       Restricting followed links
       URL normalization
       Path-ascending crawling
   A re-visit policy
       Uniform policy
       Proportional policy
   A politeness policy
   A parallelization policy         20
Indexing

   Indexing is purposed to provide high speed and
    performance in finding relevant documents in
    corpus for a search query.
   For example 10,000 documents:
       Queried within milliseconds with the help of index
       Sequential scan could take hours
   Meta search engines reuse the indices of other
    services and do not store a local index
       E.g. vertical search can use indices of vertical
        services
                                                             21
Inverted index
   For each word stores a list of documents
    containing this word
   Provides direct access to the documents
    associated with each word in the search query
   Commonly used by web search engines
   Not convenient to update




                                                    22
Forward index
   Stores a list of words for each document
   It's more handy to store words per document
    immediately during its parsing
   Enables asynchronous processing – mush easy
    to update then inverted index
   Is stored to be transformed to inverted index




                                                    23
Ranking

   Ranking is an arrangement of web search
    results in order of relevance
   Usually based on statistical methods
      Frequency of keywords in particulat document
      Rating page popularity and authority


   Advanced search engines also use intelligent
    algorithms of ranking


                                                 24
Google PageRank
   PageRank was invented in 1998 by Larry Page
    and Sergey Brin at Stanford University
   It is aimed to rate web page authority relatively
    to other web pages
   Basic principles:
       A hyperlink to a page counts as a vote of support
       Page with high number of incoming links has high
        authority
       A hyperlink coming from authoritative web page
        gives more points
   PR(p) is a probability that a person randomly
    clicking on links will arrive at page p
                                                            25
Google PageRank
      A      B      C      D

     0.25   0.25   0.25   0.25


      A      B      C      D


      1/2   1/6    1/6    1/6




      A      B      C      D


     6/17   2/17   3/17   6/17

                                 26
Google PageRank

   So, PageRank of page A:



   In the general case, the PageRank value for
    any page u:



    where Bu – set containing all pages linking to
    page u; L(v) – number of links from page v.
                                                     27
Google PageRank
   Spider traps:

                    A          B          C



   Damp factor
       d – probability that random surfer continue traversal
       (1-d) – probability of going to random site
   The result formula:



                                                            28
Web Search Engine Architecture




                                 29
Contents

✔   Introduction: what do web search engines
    mean for us today?
✔   History of web search engines
✔   How web search engines work
➔   Most popular search engines
   Conclusion: past, present and future of web
    search


                                                  30
Google
   Was started in 1996 as the research project of
    Larry Page and Sergey Brin in Stanford
    University
   Was launched in 1998
   By the end of 1998 already
    had an index of about 60
    million pages
   Quickly gained popularity due
    to PageRank algorithm
                                                     31
Google
   Today Google is the most popular web search
    engine in the world: 85% of web search market
   Provides many other services:
         Gmail
         Google maps
         Google+
         …
   Has its own OS – Android
   Provides web browser – Google Chrome
   ...                                             32
Yandex

   Was founded in 1997 by
    Arkady Volozh and Ilya Segalovich
   The first web search engine providing
    morphological search
   The prototype of Yandex search engine was a
    system for autimated searching in Bible
   The name stand for “Yet Another iNDEXer”


                                                  33
Yandex
   In 1998 Yandex launched
   contextual advertisement
   In 2001 Yandex.Direct was launched - an
    automated, auction-based system for
    placement of text-based advertising
   2005 – Ukraine portal, www.yandex.ua
   2008 – Yandex Labs in San Francisco Bay area
   2010 – English version of web search engine
   2011 - search engine and a range of other
    services in Turkey, at yandex.com.tr          34
Yandex




         35
Yandex today

   63% of Russian web search market
   More than 3500 employees
   24 offices in 8 countries




                                       36
Contents

✔   Introduction: what do web search engines
    mean for us today?
✔   History of web search engines
✔   How web search engines work
✔   Most popular search engines
➔   Conclusion: past, present and future of web
    search


                                                  37
Conclusion

   Web search engines are an integral part of our
    life today
   They did a long way before they reached
    today's performance and power
   Their development is far from being finished
   Main developing trends are:
       Web search personalization
       Local-based search
       Vertical search
                                                     38
Your questions, please




                         39
Thank you for your time!




                           40

Mais conteúdo relacionado

Semelhante a Tolmachev Alexander Web Search Engines

Search Engine working, Crawlers working, Search Engine mechanism
Search Engine working, Crawlers working, Search Engine mechanismSearch Engine working, Crawlers working, Search Engine mechanism
Search Engine working, Crawlers working, Search Engine mechanismUmang MIshra
 
Introduction To Search - SEO 101
Introduction To Search - SEO 101Introduction To Search - SEO 101
Introduction To Search - SEO 101Andrew Zarick
 
Web Mining.pptx
Web Mining.pptxWeb Mining.pptx
Web Mining.pptxScrbifPt
 
Web search engines ( Mr.Mirza )
Web search engines ( Mr.Mirza )Web search engines ( Mr.Mirza )
Web search engines ( Mr.Mirza )Ali Saif Mirza
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...butest
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...butest
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
การค้นหาสารสนเทศจาก WWW (ต่อ)
การค้นหาสารสนเทศจาก WWW (ต่อ)การค้นหาสารสนเทศจาก WWW (ต่อ)
การค้นหาสารสนเทศจาก WWW (ต่อ)Srion Janeprapapong
 
Search Engines Other than Google
Search Engines Other than GoogleSearch Engines Other than Google
Search Engines Other than GoogleDr Trivedi
 
How search engines work
How search engines workHow search engines work
How search engines workChinna Botla
 
Searchland: Search quality for Beginners
Searchland: Search quality for BeginnersSearchland: Search quality for Beginners
Searchland: Search quality for BeginnersValeria de Paiva
 
Web2.0.2012 - lesson 8 - Google world
Web2.0.2012 - lesson 8 - Google worldWeb2.0.2012 - lesson 8 - Google world
Web2.0.2012 - lesson 8 - Google worldCarlo Vaccari
 
Pagerank
PagerankPagerank
Pageranktkgcse
 

Semelhante a Tolmachev Alexander Web Search Engines (20)

Search Engine working, Crawlers working, Search Engine mechanism
Search Engine working, Crawlers working, Search Engine mechanismSearch Engine working, Crawlers working, Search Engine mechanism
Search Engine working, Crawlers working, Search Engine mechanism
 
Search Engine
Search EngineSearch Engine
Search Engine
 
Introduction To Search - SEO 101
Introduction To Search - SEO 101Introduction To Search - SEO 101
Introduction To Search - SEO 101
 
Seo Presentation
Seo PresentationSeo Presentation
Seo Presentation
 
Web Mining.pptx
Web Mining.pptxWeb Mining.pptx
Web Mining.pptx
 
Web search engines ( Mr.Mirza )
Web search engines ( Mr.Mirza )Web search engines ( Mr.Mirza )
Web search engines ( Mr.Mirza )
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...
 
Search engines
Search enginesSearch engines
Search engines
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
About search engines
About search enginesAbout search engines
About search engines
 
Web mining
Web miningWeb mining
Web mining
 
Search Engine
Search Engine Search Engine
Search Engine
 
การค้นหาสารสนเทศจาก WWW (ต่อ)
การค้นหาสารสนเทศจาก WWW (ต่อ)การค้นหาสารสนเทศจาก WWW (ต่อ)
การค้นหาสารสนเทศจาก WWW (ต่อ)
 
Search Engines Other than Google
Search Engines Other than GoogleSearch Engines Other than Google
Search Engines Other than Google
 
How search engines work
How search engines workHow search engines work
How search engines work
 
Searchland: Search quality for Beginners
Searchland: Search quality for BeginnersSearchland: Search quality for Beginners
Searchland: Search quality for Beginners
 
Web2.0.2012 - lesson 8 - Google world
Web2.0.2012 - lesson 8 - Google worldWeb2.0.2012 - lesson 8 - Google world
Web2.0.2012 - lesson 8 - Google world
 
Pagerank
PagerankPagerank
Pagerank
 

Último

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 

Último (20)

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

Tolmachev Alexander Web Search Engines

  • 1. Web search engines Alexander Tolmachev gr. #3057/2
  • 2. Contents  Introduction: what do web search engines mean for us today?  History of web search engines  How web search engines work  Most popular search engines  Conclusion: past, present and future of web search 2
  • 3. Contents ➔ Introduction: what do web search engines mean for us today?  History of web search engines  How web search engines work  Most popular search engines  Conclusion: past, present and future of web search 3
  • 4. The Web as a huge storage of information  A huge amount of information is contained in the Word Wide Web  And this amount is still growing day by day  We need to orient ourself in this enormous information space  Web search engines provide us fast search of information that we are interested in 4
  • 5. Web search engines in our life  We use web search engines every day for:  Searching texts, articles, books, news, etc.  Searching different media: music, videos, films, pictures, etc.  Searching goods  Searching web sites and web portals  Preparing lectures and presentations ☺  …  The verb “to google” is included in dictionaries  Web search engines have become an integral part of our life 5
  • 6. Contents ✔ Introduction: what do web search engines mean for us today? ➔ History of web search engines  How web search engines work  Most popular search engines  Conclusion: past, present and future of web search 6
  • 7. The very first search tools  1989–1991 – the invention of the World Wide Web by Sir Tim Berners-Lee in CERN  Archie (1990)  The first Internet search tool  Fetching and indexing files on FTP servers  Providing search for indexed files  Veronica and Jughead – similar to Archie search tools for Gopher protocol invented in 1991 7
  • 8. The first web search engines  W3Catalog (1993)  The first primitive search engine  Mirroring and integration of manually maintained catalogues  Still available: http://www.w3catalog.com/  World Wide Web Wanderer (1993)  The first web crawler  The first web index called Wandex  Aimed to count Web size, not to serve as a search tool 8
  • 9. The first web search engines  JumpStation (1993)  The first web search engine combining crawling, indexing and searching  A web form for search queries  No ranking, just listing search results  Excite (1994)  The first ranking system  WebCrawler (1994)  Indexing full text  The first widely known web search engine 9
  • 10. Web search evolution  1994–1997 – a number of similar web search engines:  Infoseek  OpenText  Magellan  Inktomi  Northern Light  AskJeeves  AltaVista 10
  • 11. Web search evolution  Yahoo! (1994)  Search in human edited hierarchical web directory  Manual solution of relevancy  Search by keywords as well as browsing full directory  Gained large popularity  Later in 2004 developed its own web search engine  One of the main stars in business world in 1990s 11
  • 12. Web search evolution  Google (1998)  The invention of Page Rank  Simple and clear interface instead of turning to a web portal  Yandex (1997)  Full-text search with Russian morphology support  Quickly gained large popularity in Russia 12
  • 13. Web search engines today  Powerful web search technologies  Maximal freshness of results  Variety of types of searchable documents  Intelligent algorithms of ranking  Media search:  Images  Music  Videos  … 13
  • 14. Web search engines today  Personalized search  Based on user's search history  Based on personal information from virtual social spaces  Location-based search  Vertical search  Image-based search  Audio-based search 14
  • 15. Contents ✔ Introduction: what do web search engines mean for us today? ✔ History of web search engines ➔ How web search engines work  Most popular search engines  Conclusion: past, present and future of web search 15
  • 16. Basic principles of web search  Create and sort a pool of data  Find the most appropriate information  Deliver this information 16
  • 17. Basic parts of web search engine  A web spider/crawler/robot – a computer program which:  Continuously traverses web pages  Finds new or changed content  Stores visited pages in corpus  Index – a database containing crawling results  Search engine – a computer program which:  Identifies pages relevant to search query  Retrieve this pages  Rank them  User interface 17
  • 18. Web crawling  Web crawling is aimed to traverse web pages and to store their copies for further indexing  General web crawler algorithm:  Starts with a list of initial URLs, called the seeds  Visits these URLs  Retrieves required information from the page  Identifies all the hyper-links on the page  Adds this links to the queue of URLs, called the crawl frontier  Recursively visit URLs from the crawl frontier 18
  • 20. Crawling policies  A selection policy  Focused crawling  Restricting followed links  URL normalization  Path-ascending crawling  A re-visit policy  Uniform policy  Proportional policy  A politeness policy  A parallelization policy 20
  • 21. Indexing  Indexing is purposed to provide high speed and performance in finding relevant documents in corpus for a search query.  For example 10,000 documents:  Queried within milliseconds with the help of index  Sequential scan could take hours  Meta search engines reuse the indices of other services and do not store a local index  E.g. vertical search can use indices of vertical services 21
  • 22. Inverted index  For each word stores a list of documents containing this word  Provides direct access to the documents associated with each word in the search query  Commonly used by web search engines  Not convenient to update 22
  • 23. Forward index  Stores a list of words for each document  It's more handy to store words per document immediately during its parsing  Enables asynchronous processing – mush easy to update then inverted index  Is stored to be transformed to inverted index 23
  • 24. Ranking  Ranking is an arrangement of web search results in order of relevance  Usually based on statistical methods  Frequency of keywords in particulat document  Rating page popularity and authority  Advanced search engines also use intelligent algorithms of ranking 24
  • 25. Google PageRank  PageRank was invented in 1998 by Larry Page and Sergey Brin at Stanford University  It is aimed to rate web page authority relatively to other web pages  Basic principles:  A hyperlink to a page counts as a vote of support  Page with high number of incoming links has high authority  A hyperlink coming from authoritative web page gives more points  PR(p) is a probability that a person randomly clicking on links will arrive at page p 25
  • 26. Google PageRank A B C D 0.25 0.25 0.25 0.25 A B C D 1/2 1/6 1/6 1/6 A B C D 6/17 2/17 3/17 6/17 26
  • 27. Google PageRank  So, PageRank of page A:  In the general case, the PageRank value for any page u: where Bu – set containing all pages linking to page u; L(v) – number of links from page v. 27
  • 28. Google PageRank  Spider traps: A B C  Damp factor  d – probability that random surfer continue traversal  (1-d) – probability of going to random site  The result formula: 28
  • 29. Web Search Engine Architecture 29
  • 30. Contents ✔ Introduction: what do web search engines mean for us today? ✔ History of web search engines ✔ How web search engines work ➔ Most popular search engines  Conclusion: past, present and future of web search 30
  • 31. Google  Was started in 1996 as the research project of Larry Page and Sergey Brin in Stanford University  Was launched in 1998  By the end of 1998 already had an index of about 60 million pages  Quickly gained popularity due to PageRank algorithm 31
  • 32. Google  Today Google is the most popular web search engine in the world: 85% of web search market  Provides many other services:  Gmail  Google maps  Google+  …  Has its own OS – Android  Provides web browser – Google Chrome  ... 32
  • 33. Yandex  Was founded in 1997 by Arkady Volozh and Ilya Segalovich  The first web search engine providing morphological search  The prototype of Yandex search engine was a system for autimated searching in Bible  The name stand for “Yet Another iNDEXer” 33
  • 34. Yandex  In 1998 Yandex launched  contextual advertisement  In 2001 Yandex.Direct was launched - an automated, auction-based system for placement of text-based advertising  2005 – Ukraine portal, www.yandex.ua  2008 – Yandex Labs in San Francisco Bay area  2010 – English version of web search engine  2011 - search engine and a range of other services in Turkey, at yandex.com.tr 34
  • 35. Yandex 35
  • 36. Yandex today  63% of Russian web search market  More than 3500 employees  24 offices in 8 countries 36
  • 37. Contents ✔ Introduction: what do web search engines mean for us today? ✔ History of web search engines ✔ How web search engines work ✔ Most popular search engines ➔ Conclusion: past, present and future of web search 37
  • 38. Conclusion  Web search engines are an integral part of our life today  They did a long way before they reached today's performance and power  Their development is far from being finished  Main developing trends are:  Web search personalization  Local-based search  Vertical search 38
  • 40. Thank you for your time! 40