SlideShare a Scribd company logo
1 of 28
“The Anatomy of a Large-Scale
Hypertextual Web Search Engine”
by Sergey Brin and Lawrence Page
Papers We Love Bucharest
Eduard Mucilianu Stefan Alexandru Adam
31st of August 2015
TechHub
Short History of Web Search
Altavista and Yahoo Directory in 1994
Google and MSN launch in 1998
• Goto.com annual revenues were close to $ 1 billion
became Overture and was acquired by Yahoo in 2003
First Pay-per-click business model for advertising
Used first-price auctions for keywords, casino was expensive
• Google AdWords uses generalized second-price auctions, not
necessarily promotes truthfulness
Prior Related Work
Primary benchmark for Information Retrieval, the Text Retrieval
Conference, TREC, used a fairly small, well controlled collection for
their benchmarks
“Some argue that on the web, users should specify more accurately
what they want and add more words to their query”
Differences Between the Web and Well Controlled Collections
External meta information includes things like reputation of the source,
update frequency, quality, popularity or usage, and citations
PageRank Scoring
Propagating weights through the link structure of the web
Aided by Anchor Text to improve search results, especially with non-text
information
The PageRank of a page is the long-term visit rate, it has its origins in Citation
Analysis
In a steady state, what is the probability to be on that page ?
Need a process that allows us to get into that steady state
Is query independent ! Just a way to measure importance of a page
Markov Chains
Set of n states and n X n transition probability matrix
for outlinks
A good way for abstracting random walks, the way a web
surfer will follow links whilst browsing
For PageRank, a state is actually a web page
Intermediate process for reaching the steady-
state probability distribution
dY dA dM
dY 1/2 1/2 0
dA 1/2 0 1/2
dM 0 1 0
Markov Property
Memorylessness of a stochastic process
The probability of next state is ONLY dependent on the current state
Even if you came from Google.com, the probability that you will jump
to Yahoo.com randomnly is the same
Dead-ends and Teleporting
Surfer can get stuck whilst browsing in
● a dead-end
or
● spider trap (group of pages that together form a dead-end)
User may choose to randomly switch to an out of context page
At a non-dead-end page jump to a random page with a chosen parameter
of around 10%-20%
Use remaining 80%-90% for out-links
Power Method
xk+1 = PTxk=(PT)kx1
x is the probability vector, we want a=PTa
Stable probability vector a is the PageRank
vector
1/2 1/2 0
1/2 0 0
0 1/2 1
=+0.8 0.2PT
=
1/3 1/3 1/3
1/3 1/3 1/3
1/3 1/3 1/3
7/15 7/15 1/15
7/15 1/15 1/15
1/15 7/15 13/15
Stable state
Start with uniformly distributed probability vector, thus giving each page equal
chances
Settle on a when | xk+1 - xk | is below desired threshold
x1 x2 = PTx1 x3 = PTx2 x4 = PTx3
1/3 1/3 7/25 97/375
1/3 1/5 1/5 67/375
1/3 7/15 13/25 211/375
. . .
PR(Yahoo) 7/33
PR(Amazon) 5/33
PR(Microsoft) 7/11
Centrality measures
• Degrees Centrality
• Measures immediate risk of catching something
• Closeness Centrality
• Measures how close a node is in relation with the other
• Betweenness Centrality
• Measures the number of times a node acts like a bridge
• Eigenvector Centrality
• Measures the influence, prestige of a node. Depends on neighbors
centrality
• PageRank is a deviation
1
8
2
7
5
4
9 10
6
3
Other Centrality measures
• Degrees Centrality
• 𝑪 𝑫(𝒙) = 𝒅𝒆𝒈 𝒙 , deg(x) the number of adjacent nodes
• Closeness Centrality
• 𝑪 𝑪 𝒙 =
𝟏
𝒚 𝒅 𝒚,𝒙
, 𝑑 𝑦, 𝑥 𝑡ℎ𝑒 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑥 𝑎𝑛𝑑 𝑦
• Betweenness Centrality
• 𝑪 𝑩 𝒙 = 𝒔≠𝒙≠𝒕
𝝈 𝒔𝒕 𝒙
𝝈 𝒔𝒕
where 𝜎𝑠𝑡 represents the total number of shortest
paths and 𝜎𝑠𝑡 𝑥 represents the number of shortest paths that pass throw x
1
8
2
7
5
4
9 10
6
3
Eigenvector Centrality
• Solve eq. 𝑨 ∙ 𝒗 = λ ∙ 𝒗, where 𝑨 is the adjacency matrix. The solution
represents the eigenvector for the biggest eigenvalue or eigenvalue 1
(in the stochastic variant)
• Consider the stochastic version of 𝑨, and consider an equal
distribution vector v. The probability is v∗=𝑨 𝒏
∙v
• Google PageRank algorithm (overcome disconnected networks)
Design requirements
• Ability to scale
• Improved search quality
• Strong performance
• Efficiency
• Modular based architecture
• Define a ranking system
• Most of google engine is written
in C and C++ and runs on
Linux/Solaris
• Avoid disk seeks whenever
possible
Design strategies
Search Text Engine - Web Object Model
• URL - identify a web resource
• Document - the html content of a web resource
• DocID - uniquely identifies a document
• Word
• unit of meaning part of a Lexicon
• Appears in document having different attributes (position, font size, link, color)
• WordID - uniquely identifies a word
• Lexicon the totality of lexemes
Search Text Engine - Base Model
Key Words URL results
IndexingCrawling Searching
Information Retrieval Process
Search Text Engine - Model
Model
Key Words URL results
Modules Responsibilities
• URLServer sends url to Crawlers
• Crawler fetches data from a URL
• Store Server compresses data and saves it to Repository storage
• Indexer
• converts the documents from Repository into words and Hits. Hits then
are stored in barrels and new words are stored in Lexicon
• Extracts links and saves them into Anchors storage
• Indexes documents and save the indexes into the Document Index
storage
• URL Resolver converts relative links to absolute links and saves
them to Links storage
• PageRank computes the page rank
• Sorter sorts hits
• Searcher computes the search
Repository
• Contains the full HTML of every page
• Records are stored one after another
• Documents are compressed using zlib
• Every record contains
• DocID - document identifier
• Length
• URL
• Document (html content)
• approx. 53 GB (stored on Large File)
Document Index
• It is a ISAM index ordered by DocID
• Any record contains
• pointer to the document record in Repository
• Document checksum
• Contains a file used to convert URLs into
docIDs
• File is ordered by checksum (an int)
• Binary search is applied for getting the DocID
• Supports batch updates
• URL resolver uses it to convert URL to DocId
• 9.7 GB
145
Checksum DocId
43 32
54 345
123 3245
.... ..
654 12312
5433 5325345
www.blabla
Lexicon
• The inventory of lexemes.
• Lexemes examples : “Eat”, “Ate”, “Eaten” represents a single lexeme;
“Flesh and blood” = expression = a single lexeme; “Flesh” = a single lexeme
• Contains the full list of words (word ~ lexeme)
• Fit in memory (293 MB)
• 14 million records
• Records are separated by null
• Each record contains
• WordID
• A pointer to the wordID in the Inverted Index
! Stemming was not supported
Hit List
• A list of occurrences of a particular word
• Represents the most precious information and is a result of a long
and expensive chain of processing
• Are stored into the Forward Index and Inverted Index
• Each hit is characterized by:
• Fancy/Plain
• Font Size (relative to the document)
• Capitalization
• Position
• Each hit is stored in two bytes in an encoded fashion
Forward Index
• Stored in 64 barrels
• If a document contains words that fall into a particular barrel,
the docID is recorded into the barrel, followed by a list of
wordID’s with hit lists which correspond to those words
• Each wordID stored as a relative difference from the minimum
wordID that falls into the barrel the wordID is in
• The Forward Index represents a transitory storage state
! Querying the forward index would require sequential iteration
through each document and to each word to verify a matching
document
Inverted Index
• Represents the data in the fully processed state
• Consists of the same barrels as the Forward Index, except that they
have been processed by the Sorter
• Sorter takes each forward barrel and sorts them by wordID
• Each wordID contains a list of documents in which appears and the
corresponding hit list
• Two sets of inverted barrels
• One set which contains fancy hits (where the initial search is done)
• Another set which contains plain hits
• Approx. 40 GB
Crawling the web
• The crawling is done by crawlers in a distributed fashion
• The crawlers were implemented in Python
• Keeps 300 open connections
• Keeps a DNS cache to avoid DNS lookups (requests to DNS servers for getting
the ip address of a host)
• Speed: 100 webpages/second
• Cultural impact, people were not familiar with robots exclusion
protocol
Indexing the Web
• Parsing
• A complex operation which must handle a lot of possible errors
• The parser is based on the flex tool available on http://flex.sourceforge.net/
• Using flex you can generate lexical parsers written in C
• Has strong performances and is very robust
• Indexing documents into barrels
• The process in which the document content is converted into hit lists and then saved
into the barrels. It results in a Forward index
• This operation is handled by the Indexer. Multiple Indexers can run in parallel
• Sorting
• The operation in which Forward Barrels are sorted based on wordID building the
Reverted Index (short inverted barrels and long inverted barrels)
• Because the barrels don't fit into memory they are splitted in baskets and then
sorted
Searching
• Involves 8 distinct steps
• Is focused on quality
• Is limited to 40k items to limit
the response time
• Has a complex ranking system
• Supports feedback
1. Parse the query.
2. Convert words into wordIDs.
3. Seek to the start of the doclist in the
short barrel for every word.
4. Scan through the doclists until there
is a document that matches all the
search terms.
5. Compute the rank of that document
for the query.
6. If we are in the short barrels and at
the end of any doclist, seek to the start
of the doclist in the full barrel for every
word and go to step 4.
7. If we are not at the end of any
doclist go to step 4. Sort the
documents that have matched by rank
and return the top k.
Searching - Ranking System
• The hits are classified based on type (title, anchor, URL, plain text
large font)
• Every type is weighted based on their importance
• The formula for ranking a word inside a page is the following
• r(w) = 𝒊 𝒘𝒆𝒊𝒈𝒉𝒕𝒊 ∙ 𝑪 𝒘(𝒄𝒐𝒖𝒏𝒕𝒊 𝒘 ) , where 𝒄𝒐𝒖𝒏𝒕𝒊 𝒘 represents the
number of appearances of the word w for type 𝒊, 𝑪 𝒘 represents the
“countweight” function, and 𝑤𝑒𝑖𝑔ℎ𝑡𝑖 is the weight for type 𝒊
• The total rank R(w) is a combination of PageRank and r(w)
System Performance

More Related Content

What's hot

Ontologies and semantic web
Ontologies and semantic webOntologies and semantic web
Ontologies and semantic webStanley Wang
 
Best Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingBest Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingOntotext
 
Annotations as Linked Data with Fedora4 and Triannon
Annotations as Linked Data with Fedora4 and TriannonAnnotations as Linked Data with Fedora4 and Triannon
Annotations as Linked Data with Fedora4 and TriannonRobert Sanderson
 
AIDR Tutorial (Artificial Intelligence for Disaster Response)
AIDR Tutorial (Artificial Intelligence for Disaster Response)AIDR Tutorial (Artificial Intelligence for Disaster Response)
AIDR Tutorial (Artificial Intelligence for Disaster Response)Muhammad Imran
 
LOTUS: Adaptive Text Search for Big Linked Data
LOTUS: Adaptive Text Search for Big Linked DataLOTUS: Adaptive Text Search for Big Linked Data
LOTUS: Adaptive Text Search for Big Linked DataFilip Ilievski
 
RDF SHACL, Annotations, and Data Frames
RDF SHACL, Annotations, and Data FramesRDF SHACL, Annotations, and Data Frames
RDF SHACL, Annotations, and Data FramesKurt Cagle
 
Resource description framework
Resource description frameworkResource description framework
Resource description frameworkStanley Wang
 
Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Lotus: Linked Open Text UnleaShed - ISWC COLD '15Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Lotus: Linked Open Text UnleaShed - ISWC COLD '15Filip Ilievski
 
A Semantic Data Model for Web Applications
A Semantic Data Model for Web ApplicationsA Semantic Data Model for Web Applications
A Semantic Data Model for Web ApplicationsArmin Haller
 
Back to Basics 1: Thinking in documents
Back to Basics 1: Thinking in documentsBack to Basics 1: Thinking in documents
Back to Basics 1: Thinking in documentsMongoDB
 
Using OWL for the RESO Data Dictionary
Using OWL for the RESO Data DictionaryUsing OWL for the RESO Data Dictionary
Using OWL for the RESO Data DictionaryChimezie Ogbuji
 
Linked Data for Czech Legislation
Linked Data for Czech LegislationLinked Data for Czech Legislation
Linked Data for Czech LegislationMartin Necasky
 

What's hot (14)

Ontologies and semantic web
Ontologies and semantic webOntologies and semantic web
Ontologies and semantic web
 
Best Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingBest Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining Processing
 
Annotations as Linked Data with Fedora4 and Triannon
Annotations as Linked Data with Fedora4 and TriannonAnnotations as Linked Data with Fedora4 and Triannon
Annotations as Linked Data with Fedora4 and Triannon
 
AIDR Tutorial (Artificial Intelligence for Disaster Response)
AIDR Tutorial (Artificial Intelligence for Disaster Response)AIDR Tutorial (Artificial Intelligence for Disaster Response)
AIDR Tutorial (Artificial Intelligence for Disaster Response)
 
LOTUS: Adaptive Text Search for Big Linked Data
LOTUS: Adaptive Text Search for Big Linked DataLOTUS: Adaptive Text Search for Big Linked Data
LOTUS: Adaptive Text Search for Big Linked Data
 
RDF SHACL, Annotations, and Data Frames
RDF SHACL, Annotations, and Data FramesRDF SHACL, Annotations, and Data Frames
RDF SHACL, Annotations, and Data Frames
 
Resource description framework
Resource description frameworkResource description framework
Resource description framework
 
NLP and the Web
NLP and the WebNLP and the Web
NLP and the Web
 
Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Lotus: Linked Open Text UnleaShed - ISWC COLD '15Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Lotus: Linked Open Text UnleaShed - ISWC COLD '15
 
A Semantic Data Model for Web Applications
A Semantic Data Model for Web ApplicationsA Semantic Data Model for Web Applications
A Semantic Data Model for Web Applications
 
Back to Basics 1: Thinking in documents
Back to Basics 1: Thinking in documentsBack to Basics 1: Thinking in documents
Back to Basics 1: Thinking in documents
 
Harvester_presentaion
Harvester_presentaionHarvester_presentaion
Harvester_presentaion
 
Using OWL for the RESO Data Dictionary
Using OWL for the RESO Data DictionaryUsing OWL for the RESO Data Dictionary
Using OWL for the RESO Data Dictionary
 
Linked Data for Czech Legislation
Linked Data for Czech LegislationLinked Data for Czech Legislation
Linked Data for Czech Legislation
 

Similar to "PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” presentation @ Papers We Love Bucharest

The Anatomy of a Large-Scale Hypertextual Web Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search EngineThe Anatomy of a Large-Scale Hypertextual Web Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search EngineMehul Boricha
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...Joaquin Delgado PhD.
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...S. Diana Hu
 
Web indexing finale
Web indexing finaleWeb indexing finale
Web indexing finaleAjit More
 
Semantic framework for web scraping.
Semantic framework for web scraping.Semantic framework for web scraping.
Semantic framework for web scraping.Shyjal Raazi
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Rahul Jain
 
Google Paper
Google Paper Google Paper
Google Paper girish1m
 
Open Source SQL Databases
Open Source SQL DatabasesOpen Source SQL Databases
Open Source SQL DatabasesEmanuel Calvo
 
Crossref Content Registration - LIVE Mumbai
Crossref Content Registration - LIVE MumbaiCrossref Content Registration - LIVE Mumbai
Crossref Content Registration - LIVE MumbaiCrossref
 
Content Registration at Crossref - LIVE Kuala Lumpur
Content Registration at Crossref - LIVE Kuala LumpurContent Registration at Crossref - LIVE Kuala Lumpur
Content Registration at Crossref - LIVE Kuala LumpurCrossref
 
Got documents Code Mash Revision
Got documents Code Mash RevisionGot documents Code Mash Revision
Got documents Code Mash RevisionMaggie Pint
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawlervinay arora
 
ZendCon 2011 UnCon Domain-Driven Design
ZendCon 2011 UnCon Domain-Driven DesignZendCon 2011 UnCon Domain-Driven Design
ZendCon 2011 UnCon Domain-Driven DesignBradley Holt
 
TechEd AU 2014: Microsoft Azure DocumentDB Deep Dive
TechEd AU 2014: Microsoft Azure DocumentDB Deep DiveTechEd AU 2014: Microsoft Azure DocumentDB Deep Dive
TechEd AU 2014: Microsoft Azure DocumentDB Deep DiveIntergen
 
Semantic Web use cases in outcomes research
Semantic Web use cases in outcomes researchSemantic Web use cases in outcomes research
Semantic Web use cases in outcomes researchChimezie Ogbuji
 
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015RIILP
 

Similar to "PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” presentation @ Papers We Love Bucharest (20)

DC presentation 1
DC presentation 1DC presentation 1
DC presentation 1
 
The Anatomy of a Large-Scale Hypertextual Web Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search EngineThe Anatomy of a Large-Scale Hypertextual Web Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search Engine
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 
Web indexing finale
Web indexing finaleWeb indexing finale
Web indexing finale
 
Semantic framework for web scraping.
Semantic framework for web scraping.Semantic framework for web scraping.
Semantic framework for web scraping.
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
Elasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetupElasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetup
 
Google Paper
Google Paper Google Paper
Google Paper
 
Websrc~1
Websrc~1Websrc~1
Websrc~1
 
Open Source SQL Databases
Open Source SQL DatabasesOpen Source SQL Databases
Open Source SQL Databases
 
Crossref Content Registration - LIVE Mumbai
Crossref Content Registration - LIVE MumbaiCrossref Content Registration - LIVE Mumbai
Crossref Content Registration - LIVE Mumbai
 
Content Registration at Crossref - LIVE Kuala Lumpur
Content Registration at Crossref - LIVE Kuala LumpurContent Registration at Crossref - LIVE Kuala Lumpur
Content Registration at Crossref - LIVE Kuala Lumpur
 
Got documents Code Mash Revision
Got documents Code Mash RevisionGot documents Code Mash Revision
Got documents Code Mash Revision
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
 
ZendCon 2011 UnCon Domain-Driven Design
ZendCon 2011 UnCon Domain-Driven DesignZendCon 2011 UnCon Domain-Driven Design
ZendCon 2011 UnCon Domain-Driven Design
 
TechEd AU 2014: Microsoft Azure DocumentDB Deep Dive
TechEd AU 2014: Microsoft Azure DocumentDB Deep DiveTechEd AU 2014: Microsoft Azure DocumentDB Deep Dive
TechEd AU 2014: Microsoft Azure DocumentDB Deep Dive
 
Semantic Web use cases in outcomes research
Semantic Web use cases in outcomes researchSemantic Web use cases in outcomes research
Semantic Web use cases in outcomes research
 
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
 
Got documents?
Got documents?Got documents?
Got documents?
 

Recently uploaded

What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 

Recently uploaded (20)

What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 

"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” presentation @ Papers We Love Bucharest

  • 1. “The Anatomy of a Large-Scale Hypertextual Web Search Engine” by Sergey Brin and Lawrence Page Papers We Love Bucharest Eduard Mucilianu Stefan Alexandru Adam 31st of August 2015 TechHub
  • 2. Short History of Web Search Altavista and Yahoo Directory in 1994 Google and MSN launch in 1998 • Goto.com annual revenues were close to $ 1 billion became Overture and was acquired by Yahoo in 2003 First Pay-per-click business model for advertising Used first-price auctions for keywords, casino was expensive • Google AdWords uses generalized second-price auctions, not necessarily promotes truthfulness
  • 3. Prior Related Work Primary benchmark for Information Retrieval, the Text Retrieval Conference, TREC, used a fairly small, well controlled collection for their benchmarks “Some argue that on the web, users should specify more accurately what they want and add more words to their query” Differences Between the Web and Well Controlled Collections External meta information includes things like reputation of the source, update frequency, quality, popularity or usage, and citations
  • 4. PageRank Scoring Propagating weights through the link structure of the web Aided by Anchor Text to improve search results, especially with non-text information The PageRank of a page is the long-term visit rate, it has its origins in Citation Analysis In a steady state, what is the probability to be on that page ? Need a process that allows us to get into that steady state Is query independent ! Just a way to measure importance of a page
  • 5. Markov Chains Set of n states and n X n transition probability matrix for outlinks A good way for abstracting random walks, the way a web surfer will follow links whilst browsing For PageRank, a state is actually a web page Intermediate process for reaching the steady- state probability distribution dY dA dM dY 1/2 1/2 0 dA 1/2 0 1/2 dM 0 1 0
  • 6. Markov Property Memorylessness of a stochastic process The probability of next state is ONLY dependent on the current state Even if you came from Google.com, the probability that you will jump to Yahoo.com randomnly is the same
  • 7. Dead-ends and Teleporting Surfer can get stuck whilst browsing in ● a dead-end or ● spider trap (group of pages that together form a dead-end) User may choose to randomly switch to an out of context page At a non-dead-end page jump to a random page with a chosen parameter of around 10%-20% Use remaining 80%-90% for out-links
  • 8. Power Method xk+1 = PTxk=(PT)kx1 x is the probability vector, we want a=PTa Stable probability vector a is the PageRank vector 1/2 1/2 0 1/2 0 0 0 1/2 1 =+0.8 0.2PT = 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 7/15 7/15 1/15 7/15 1/15 1/15 1/15 7/15 13/15
  • 9. Stable state Start with uniformly distributed probability vector, thus giving each page equal chances Settle on a when | xk+1 - xk | is below desired threshold x1 x2 = PTx1 x3 = PTx2 x4 = PTx3 1/3 1/3 7/25 97/375 1/3 1/5 1/5 67/375 1/3 7/15 13/25 211/375 . . . PR(Yahoo) 7/33 PR(Amazon) 5/33 PR(Microsoft) 7/11
  • 10. Centrality measures • Degrees Centrality • Measures immediate risk of catching something • Closeness Centrality • Measures how close a node is in relation with the other • Betweenness Centrality • Measures the number of times a node acts like a bridge • Eigenvector Centrality • Measures the influence, prestige of a node. Depends on neighbors centrality • PageRank is a deviation 1 8 2 7 5 4 9 10 6 3
  • 11. Other Centrality measures • Degrees Centrality • 𝑪 𝑫(𝒙) = 𝒅𝒆𝒈 𝒙 , deg(x) the number of adjacent nodes • Closeness Centrality • 𝑪 𝑪 𝒙 = 𝟏 𝒚 𝒅 𝒚,𝒙 , 𝑑 𝑦, 𝑥 𝑡ℎ𝑒 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑥 𝑎𝑛𝑑 𝑦 • Betweenness Centrality • 𝑪 𝑩 𝒙 = 𝒔≠𝒙≠𝒕 𝝈 𝒔𝒕 𝒙 𝝈 𝒔𝒕 where 𝜎𝑠𝑡 represents the total number of shortest paths and 𝜎𝑠𝑡 𝑥 represents the number of shortest paths that pass throw x 1 8 2 7 5 4 9 10 6 3
  • 12. Eigenvector Centrality • Solve eq. 𝑨 ∙ 𝒗 = λ ∙ 𝒗, where 𝑨 is the adjacency matrix. The solution represents the eigenvector for the biggest eigenvalue or eigenvalue 1 (in the stochastic variant) • Consider the stochastic version of 𝑨, and consider an equal distribution vector v. The probability is v∗=𝑨 𝒏 ∙v • Google PageRank algorithm (overcome disconnected networks)
  • 13. Design requirements • Ability to scale • Improved search quality • Strong performance • Efficiency • Modular based architecture • Define a ranking system • Most of google engine is written in C and C++ and runs on Linux/Solaris • Avoid disk seeks whenever possible Design strategies
  • 14. Search Text Engine - Web Object Model • URL - identify a web resource • Document - the html content of a web resource • DocID - uniquely identifies a document • Word • unit of meaning part of a Lexicon • Appears in document having different attributes (position, font size, link, color) • WordID - uniquely identifies a word • Lexicon the totality of lexemes
  • 15. Search Text Engine - Base Model Key Words URL results IndexingCrawling Searching Information Retrieval Process
  • 16. Search Text Engine - Model Model Key Words URL results
  • 17. Modules Responsibilities • URLServer sends url to Crawlers • Crawler fetches data from a URL • Store Server compresses data and saves it to Repository storage • Indexer • converts the documents from Repository into words and Hits. Hits then are stored in barrels and new words are stored in Lexicon • Extracts links and saves them into Anchors storage • Indexes documents and save the indexes into the Document Index storage • URL Resolver converts relative links to absolute links and saves them to Links storage • PageRank computes the page rank • Sorter sorts hits • Searcher computes the search
  • 18. Repository • Contains the full HTML of every page • Records are stored one after another • Documents are compressed using zlib • Every record contains • DocID - document identifier • Length • URL • Document (html content) • approx. 53 GB (stored on Large File)
  • 19. Document Index • It is a ISAM index ordered by DocID • Any record contains • pointer to the document record in Repository • Document checksum • Contains a file used to convert URLs into docIDs • File is ordered by checksum (an int) • Binary search is applied for getting the DocID • Supports batch updates • URL resolver uses it to convert URL to DocId • 9.7 GB 145 Checksum DocId 43 32 54 345 123 3245 .... .. 654 12312 5433 5325345 www.blabla
  • 20. Lexicon • The inventory of lexemes. • Lexemes examples : “Eat”, “Ate”, “Eaten” represents a single lexeme; “Flesh and blood” = expression = a single lexeme; “Flesh” = a single lexeme • Contains the full list of words (word ~ lexeme) • Fit in memory (293 MB) • 14 million records • Records are separated by null • Each record contains • WordID • A pointer to the wordID in the Inverted Index ! Stemming was not supported
  • 21. Hit List • A list of occurrences of a particular word • Represents the most precious information and is a result of a long and expensive chain of processing • Are stored into the Forward Index and Inverted Index • Each hit is characterized by: • Fancy/Plain • Font Size (relative to the document) • Capitalization • Position • Each hit is stored in two bytes in an encoded fashion
  • 22. Forward Index • Stored in 64 barrels • If a document contains words that fall into a particular barrel, the docID is recorded into the barrel, followed by a list of wordID’s with hit lists which correspond to those words • Each wordID stored as a relative difference from the minimum wordID that falls into the barrel the wordID is in • The Forward Index represents a transitory storage state ! Querying the forward index would require sequential iteration through each document and to each word to verify a matching document
  • 23. Inverted Index • Represents the data in the fully processed state • Consists of the same barrels as the Forward Index, except that they have been processed by the Sorter • Sorter takes each forward barrel and sorts them by wordID • Each wordID contains a list of documents in which appears and the corresponding hit list • Two sets of inverted barrels • One set which contains fancy hits (where the initial search is done) • Another set which contains plain hits • Approx. 40 GB
  • 24. Crawling the web • The crawling is done by crawlers in a distributed fashion • The crawlers were implemented in Python • Keeps 300 open connections • Keeps a DNS cache to avoid DNS lookups (requests to DNS servers for getting the ip address of a host) • Speed: 100 webpages/second • Cultural impact, people were not familiar with robots exclusion protocol
  • 25. Indexing the Web • Parsing • A complex operation which must handle a lot of possible errors • The parser is based on the flex tool available on http://flex.sourceforge.net/ • Using flex you can generate lexical parsers written in C • Has strong performances and is very robust • Indexing documents into barrels • The process in which the document content is converted into hit lists and then saved into the barrels. It results in a Forward index • This operation is handled by the Indexer. Multiple Indexers can run in parallel • Sorting • The operation in which Forward Barrels are sorted based on wordID building the Reverted Index (short inverted barrels and long inverted barrels) • Because the barrels don't fit into memory they are splitted in baskets and then sorted
  • 26. Searching • Involves 8 distinct steps • Is focused on quality • Is limited to 40k items to limit the response time • Has a complex ranking system • Supports feedback 1. Parse the query. 2. Convert words into wordIDs. 3. Seek to the start of the doclist in the short barrel for every word. 4. Scan through the doclists until there is a document that matches all the search terms. 5. Compute the rank of that document for the query. 6. If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4. 7. If we are not at the end of any doclist go to step 4. Sort the documents that have matched by rank and return the top k.
  • 27. Searching - Ranking System • The hits are classified based on type (title, anchor, URL, plain text large font) • Every type is weighted based on their importance • The formula for ranking a word inside a page is the following • r(w) = 𝒊 𝒘𝒆𝒊𝒈𝒉𝒕𝒊 ∙ 𝑪 𝒘(𝒄𝒐𝒖𝒏𝒕𝒊 𝒘 ) , where 𝒄𝒐𝒖𝒏𝒕𝒊 𝒘 represents the number of appearances of the word w for type 𝒊, 𝑪 𝒘 represents the “countweight” function, and 𝑤𝑒𝑖𝑔ℎ𝑡𝑖 is the weight for type 𝒊 • The total rank R(w) is a combination of PageRank and r(w)

Editor's Notes

  1. The node centrality depends on friends centrality
  2. As was stated in the paper in chapter 6 Google is designed to be a scalable search engine. The primary goal is to provide high quality search results over a rapidly growing World Wide Web.
  3. - What is amazing - is the model complexity - very modular - Pagerank is a very small piece in the search engine. - Also before google there were some engines which were easy to fool. Some older search engines were text base search only. For example a nonsense html page containing the word "math" a millions of time was displayed first just because the engine counted only the appearance number  without taking in consideration the page.
  4. Contains all web content!!!!! It is like a backup for all the web content Traditionally, many operating systems and their underlying file system implementations used 32-bit integers to represent file sizes and positions. Consequently, no file could be larger than 232 − 1 bytes (4 GB − 1). In many implementations, the problem was exacerbated by treating the sizes as signed numbers, which further lowered the limit to 231 − 1 bytes (2 GB − 1).  Files that were too large for 32-bit operating systems to handle came to be known as large files. In that period the library were not prepared for Large Files. For example FSeek(int32 ) Data is stored in big files  (virtual files spanning multiple file system) Represents the raw data. Is the source input for Indexer - zlib based on Deflate Algorithm (known Huffman) - bzip uses Burrows–Wheeler transform
  5. ISAM index  (Indexed Sequential Access Method)- invented by IBM Url is converted into an int. BatchUpdate  - two ordered lists can be easily merged. Actually all merge algorithms are based on sorting.  MergeSort
  6. That is a very important compound.  Lexicon is not just a collection of words. There is a collection of words or compound word, expressions with a meaning. For accurate searching is very important to apply the search based on the phrase meaning not  based on individual words
  7. We have started from the URL. The raw data is the document content. URL Server - > Crawler - >Repository - > Indexer -> Lexicon - > HitList We will see that google ranks differently a hit in the based on the font size, link and position A fancy hit means a hit which was done in a url, anchor, title or meta tag
  8. Important note – Querying the forward index would require sequential iteration through each document and to each word to verify a matching document
  9. We see that any word id from the lexicon contains a pointer to the word id from the Inverted index