SlideShare a Scribd company logo
1 of 25
Download to read offline
Building Googlebot
Youngjin Kim
October 15, 2013
http://www.creditwritedowns.com/2011/07/european-monetary-union-titanic.html
From the web to your query
● Query processing
1. Lookup keywords in the index => every relevant page
2. Rank pages and display the result
● Google's index of the web
keyword => { page1, page2, ... }
● Building the index requires processing the current version of
all of the pages on the web...
All of the pages on the web!?!
60 Trillion Pages And Counting!
Our local copy of the web
● Crawling
○ Googlebot
● Storage
○ Google File System (GFS), BigTable
● Processing
○ MapReduce
● Data Centers
○ Job control, Fault-Tolerance, High-Speed Networking,
Power/Cooling, etc.
Finding every page with googlebot
● Basic discovery crawl
1. Start with the set
of known links
2. Crawl every link
(pages change!)
3. Extract every
new link, repeat

Extract Links

Crawl
Status

Web
Page

Crawl Pages
Key considerations in crawling
● Polite crawling
○ Do not overload websites and DNS (no DoS!)
○ Understand web serving infrastructure
● Prioritize among discovered links
○ Crawl is a giant queuing system
○ Predicting serving capacity
● Do not waste resources
○ Ignore spam/broken links
○ Skip links with duplicate content
Mirrors
● Hosts with exactly the same content
deview.kr
www.deview.kr

● Paths within hosts with the same content
www.cs.unc.edu/Courses/comp426-f09/docs/tools/downloads/tomcat/
jakarta-tomcat-4.1.29/webapps/tomcat-docs
www.cs.unc.edu/Courses/comp590-001-f08/docs/tools/downloads/tomcat/
jakarta-tomcat-4.1.29/webapps/tomcat-docs
www.cs.unc.edu/Courses/comp590-001-f08/tools/downloads/tomcat/
jakarta-tomcat-4.1.29/webapps/tomcat-docs
www.cs.unc.edu/Courses/jbs/tools/downloads/tomcat/
jakarta-tomcat/4.1.29/webapps/tomcat-docs

● Unrestricted mirroring across hosts and paths
○ Distributed graph mining
Optimizing our crawling
● Efficient crawling requires duplicate handling
○ Predict whether a newly discovered link points to
duplicate content
○ Must happen before crawling
useful(link, status_table) => { yes, no }
Duplicates in Dynamic Pages
● Duplicates are most common in dynamic links
http://foo.com/forum/viewtopic.php?t=3808&sid=126bc5f2
http://foo.com/forum/viewtopic.php?t=3808&sid=d5b8483b
http://foo.com/forum/viewtopic.php?t=3808&sid=3b1a8e27
http://foo.com/forum/viewtopic.php?t=3808&sid=2a21f059
...

● Significance analysis
○ Parameter t is a relevant
○ Parameter sid is irrelevant
● Duplicate prediction
http://foo.com/forum/viewtopic.php?t=3808&sid=ee5da24a

Same
Content
Equivalence rules and class names
● Equivalence rule for a cluster
○ Set of relevant parameters
○ Set of irrelevant parameters
● Equivalence class name
○ Remove irrelevant parameters
ECN(link1) = ECN(link2) => Same content!
○ For the previous example
ECN(http://foo.com/forum/viewtopic.php?t=3808&sid=ee5da24a) =
http://foo.com/forum/viewtopic.php?t=3808
Modified crawl algorithm
● Representative table
○ Equivalence class name => representative link
● Given a new link
1. Identify cluster
2. Lookup equivalence rule
3. Apply rule to determine equivalence class name
4. Lookup table of representatives
5. Crawl link if no representative found
Equivalence rule generation
● Find every crawled link under a cluster
cluster = { link1 : content1, link2 : content2, ... }
● Study evidence
1. Insignificance analysis
2. Significance analysis
3. Parameter classification
4. Equivalence rule construction
rule(cluster) = {
param1 : RELEVANT,
param2 : IRRELEVANT,
param3 : CONFLICT,
...
}
1. Insignificance analysis
● Group links by content
content1 = { link11, link21, ... }
content2 = { link21, link22, ... }
...
● For each parameter
○ For each content group with this parameter
■ If parameter values are not the same, add the number
of links to the insignificance index
2. Significance analysis
● For each parameter
○ Remove the parameter from every link
■ Group content by remainder link
remainder1 = { content11, content21, ... }
remainder2 = { content21, content22, ... }
...
■ Increment significance index by the number of unique
contents minus 1
3. Parameter classification
● For each parameter
○ Compute content relevance (or irrelevance) value
Significance_Index
Content_Relevance =
Significance_Index + Insignificance_Index
Insignificance_Index
Content_Irrelevance =
Significance_Index + Insignificance_Index

○ Sample criteria: 90/10 rule
■ If relevance > 90 => parameter is RELEVANT
■ If relevance < 10 => parameter is IRRELEVANT
■ Otherwise, parameter is CONFLICT
Example: P is content-irrelevant
http://foo.com/directory?P=1&Q=3
http://foo.com/directory?P=2&Q=3
Cluster

Content A

http://foo.com/directory?P=1&Q=2
http://foo.com/directory?P=2&Q=2
http://foo.com/directory?P=3&Q=2
http://foo.com/directory?P=4&Q=2

Content B

Insignificance Analysis of P

Significance Analysis of P

Content A

Content B

Q=3

Q=2

2 links,
different Ps

4 links,
different Ps

2 links,
Content A

4 links,
Content B

P's insignificance index = 2 + 4 = 6
P's content-irrelevance value = 100%

P's significance index = 0
P's content-relevance value = 0%
Example: Q is content-relevant
http://foo.com/directory?P=1&Q=3
http://foo.com/directory?P=2&Q=3
Cluster

Content A

http://foo.com/directory?P=1&Q=2
http://foo.com/directory?P=2&Q=2
http://foo.com/directory?P=3&Q=2
http://foo.com/directory?P=4&Q=2

Content B

Insignificance Analysis of Q

Significance Analysis of Q

Content A

Content B

P=1

P=2

2 links,
same Q

4 links,
same Q

2 links,
Content A&B

2 links,
Content A&B

Q's insignificance index = 0
Q's content-irrelevance value = 0%

Q's significance index = 1 + 1 = 2
Q's content-relevance value = 100%
Facing the Real World
● Limitations
○ Co-changing parameters
○ Noisy data
○ Parameters not used in the standard way
○ Need for continuous validation
● State-of-the-art
○ White-box vs black-box
● Search is not solved
○ Not even crawling is solved!
Defining duplicates
● Identical pages
● Identical visible content
● Essentially identical visible content
○ Ignore page generation time
○ Ignore breaking news side bar
○ etc.
● What is the right answer?
Two pages should be considered duplicates
if our users would consider them duplicates
● How to translate this notion into a checksum?
Q&A
Thank You!

More Related Content

Similar to 212 building googlebot - deview - google drive

Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame APIdatamantra
 
MySQL Query Optimisation 101
MySQL Query Optimisation 101MySQL Query Optimisation 101
MySQL Query Optimisation 101Federico Razzoli
 
MySQL Optimizer Overview
MySQL Optimizer OverviewMySQL Optimizer Overview
MySQL Optimizer OverviewOlav Sandstå
 
MySQL Optimizer Overview
MySQL Optimizer OverviewMySQL Optimizer Overview
MySQL Optimizer OverviewOlav Sandstå
 
EuroPython 2013 - Python3 TurboGears Training
EuroPython 2013 - Python3 TurboGears TrainingEuroPython 2013 - Python3 TurboGears Training
EuroPython 2013 - Python3 TurboGears TrainingAlessandro Molina
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Graph processing at scale using spark &amp; graph frames
Graph processing at scale using spark &amp; graph framesGraph processing at scale using spark &amp; graph frames
Graph processing at scale using spark &amp; graph framesRon Barabash
 
Journey through high performance django application
Journey through high performance django applicationJourney through high performance django application
Journey through high performance django applicationbangaloredjangousergroup
 
Query optimization in Apache Tajo
Query optimization in Apache TajoQuery optimization in Apache Tajo
Query optimization in Apache TajoJihoon Son
 
Tahoe Dreamin 2018: It simply works... until it breaks!
Tahoe Dreamin 2018: It simply works... until it breaks!Tahoe Dreamin 2018: It simply works... until it breaks!
Tahoe Dreamin 2018: It simply works... until it breaks!Daniel Stange
 
Lead Score Web Visitors For KILLER Remarketing, Upsell and Exit Intent Strate...
Lead Score Web Visitors For KILLER Remarketing, Upsell and Exit Intent Strate...Lead Score Web Visitors For KILLER Remarketing, Upsell and Exit Intent Strate...
Lead Score Web Visitors For KILLER Remarketing, Upsell and Exit Intent Strate...Hanapin Marketing
 
Most Advanced GTM Deployment. Ever!
Most Advanced GTM Deployment. Ever!Most Advanced GTM Deployment. Ever!
Most Advanced GTM Deployment. Ever!Phil Pearce
 
Data Analysis with TensorFlow in PostgreSQL
Data Analysis with TensorFlow in PostgreSQLData Analysis with TensorFlow in PostgreSQL
Data Analysis with TensorFlow in PostgreSQLEDB
 
How MySQL can boost (or kill) your application v2
How MySQL can boost (or kill) your application v2How MySQL can boost (or kill) your application v2
How MySQL can boost (or kill) your application v2Federico Razzoli
 
Salesforce Apex Hours : How Lightning Platform Query Optimizer works for LDV
Salesforce Apex Hours : How Lightning Platform Query Optimizer works for LDVSalesforce Apex Hours : How Lightning Platform Query Optimizer works for LDV
Salesforce Apex Hours : How Lightning Platform Query Optimizer works for LDVAmit Chaudhary
 
What’s New in Imply 3.3 & Apache Druid 0.18
What’s New in Imply 3.3 & Apache Druid 0.18What’s New in Imply 3.3 & Apache Druid 0.18
What’s New in Imply 3.3 & Apache Druid 0.18Imply
 
Grokking TechTalk #20: PostgreSQL Internals 101
Grokking TechTalk #20: PostgreSQL Internals 101Grokking TechTalk #20: PostgreSQL Internals 101
Grokking TechTalk #20: PostgreSQL Internals 101Grokking VN
 

Similar to 212 building googlebot - deview - google drive (20)

Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
 
Spark Meetup
Spark MeetupSpark Meetup
Spark Meetup
 
MySQL Query Optimisation 101
MySQL Query Optimisation 101MySQL Query Optimisation 101
MySQL Query Optimisation 101
 
MySQL Optimizer Overview
MySQL Optimizer OverviewMySQL Optimizer Overview
MySQL Optimizer Overview
 
MySQL Optimizer Overview
MySQL Optimizer OverviewMySQL Optimizer Overview
MySQL Optimizer Overview
 
EuroPython 2013 - Python3 TurboGears Training
EuroPython 2013 - Python3 TurboGears TrainingEuroPython 2013 - Python3 TurboGears Training
EuroPython 2013 - Python3 TurboGears Training
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Cloud Computing Project
Cloud Computing ProjectCloud Computing Project
Cloud Computing Project
 
Graph processing at scale using spark &amp; graph frames
Graph processing at scale using spark &amp; graph framesGraph processing at scale using spark &amp; graph frames
Graph processing at scale using spark &amp; graph frames
 
Journey through high performance django application
Journey through high performance django applicationJourney through high performance django application
Journey through high performance django application
 
Query optimization in Apache Tajo
Query optimization in Apache TajoQuery optimization in Apache Tajo
Query optimization in Apache Tajo
 
Tahoe Dreamin 2018: It simply works... until it breaks!
Tahoe Dreamin 2018: It simply works... until it breaks!Tahoe Dreamin 2018: It simply works... until it breaks!
Tahoe Dreamin 2018: It simply works... until it breaks!
 
Lead Score Web Visitors For KILLER Remarketing, Upsell and Exit Intent Strate...
Lead Score Web Visitors For KILLER Remarketing, Upsell and Exit Intent Strate...Lead Score Web Visitors For KILLER Remarketing, Upsell and Exit Intent Strate...
Lead Score Web Visitors For KILLER Remarketing, Upsell and Exit Intent Strate...
 
Most Advanced GTM Deployment. Ever!
Most Advanced GTM Deployment. Ever!Most Advanced GTM Deployment. Ever!
Most Advanced GTM Deployment. Ever!
 
Data Analysis with TensorFlow in PostgreSQL
Data Analysis with TensorFlow in PostgreSQLData Analysis with TensorFlow in PostgreSQL
Data Analysis with TensorFlow in PostgreSQL
 
How MySQL can boost (or kill) your application v2
How MySQL can boost (or kill) your application v2How MySQL can boost (or kill) your application v2
How MySQL can boost (or kill) your application v2
 
Salesforce Apex Hours : How Lightning Platform Query Optimizer works for LDV
Salesforce Apex Hours : How Lightning Platform Query Optimizer works for LDVSalesforce Apex Hours : How Lightning Platform Query Optimizer works for LDV
Salesforce Apex Hours : How Lightning Platform Query Optimizer works for LDV
 
What’s New in Imply 3.3 & Apache Druid 0.18
What’s New in Imply 3.3 & Apache Druid 0.18What’s New in Imply 3.3 & Apache Druid 0.18
What’s New in Imply 3.3 & Apache Druid 0.18
 
Grokking TechTalk #20: PostgreSQL Internals 101
Grokking TechTalk #20: PostgreSQL Internals 101Grokking TechTalk #20: PostgreSQL Internals 101
Grokking TechTalk #20: PostgreSQL Internals 101
 
AngularJS Basics
AngularJS BasicsAngularJS Basics
AngularJS Basics
 

More from NAVER D2

[211] 인공지능이 인공지능 챗봇을 만든다
[211] 인공지능이 인공지능 챗봇을 만든다[211] 인공지능이 인공지능 챗봇을 만든다
[211] 인공지능이 인공지능 챗봇을 만든다NAVER D2
 
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...NAVER D2
 
[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기NAVER D2
 
[245]Papago Internals: 모델분석과 응용기술 개발
[245]Papago Internals: 모델분석과 응용기술 개발[245]Papago Internals: 모델분석과 응용기술 개발
[245]Papago Internals: 모델분석과 응용기술 개발NAVER D2
 
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈NAVER D2
 
[235]Wikipedia-scale Q&A
[235]Wikipedia-scale Q&A[235]Wikipedia-scale Q&A
[235]Wikipedia-scale Q&ANAVER D2
 
[244]로봇이 현실 세계에 대해 학습하도록 만들기
[244]로봇이 현실 세계에 대해 학습하도록 만들기[244]로봇이 현실 세계에 대해 학습하도록 만들기
[244]로봇이 현실 세계에 대해 학습하도록 만들기NAVER D2
 
[243] Deep Learning to help student’s Deep Learning
[243] Deep Learning to help student’s Deep Learning[243] Deep Learning to help student’s Deep Learning
[243] Deep Learning to help student’s Deep LearningNAVER D2
 
[234]Fast & Accurate Data Annotation Pipeline for AI applications
[234]Fast & Accurate Data Annotation Pipeline for AI applications[234]Fast & Accurate Data Annotation Pipeline for AI applications
[234]Fast & Accurate Data Annotation Pipeline for AI applicationsNAVER D2
 
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load BalancingOld version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load BalancingNAVER D2
 
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지NAVER D2
 
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기NAVER D2
 
[224]네이버 검색과 개인화
[224]네이버 검색과 개인화[224]네이버 검색과 개인화
[224]네이버 검색과 개인화NAVER D2
 
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)NAVER D2
 
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기NAVER D2
 
[213] Fashion Visual Search
[213] Fashion Visual Search[213] Fashion Visual Search
[213] Fashion Visual SearchNAVER D2
 
[232] TensorRT를 활용한 딥러닝 Inference 최적화
[232] TensorRT를 활용한 딥러닝 Inference 최적화[232] TensorRT를 활용한 딥러닝 Inference 최적화
[232] TensorRT를 활용한 딥러닝 Inference 최적화NAVER D2
 
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지NAVER D2
 
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터NAVER D2
 
[223]기계독해 QA: 검색인가, NLP인가?
[223]기계독해 QA: 검색인가, NLP인가?[223]기계독해 QA: 검색인가, NLP인가?
[223]기계독해 QA: 검색인가, NLP인가?NAVER D2
 

More from NAVER D2 (20)

[211] 인공지능이 인공지능 챗봇을 만든다
[211] 인공지능이 인공지능 챗봇을 만든다[211] 인공지능이 인공지능 챗봇을 만든다
[211] 인공지능이 인공지능 챗봇을 만든다
 
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
 
[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기
 
[245]Papago Internals: 모델분석과 응용기술 개발
[245]Papago Internals: 모델분석과 응용기술 개발[245]Papago Internals: 모델분석과 응용기술 개발
[245]Papago Internals: 모델분석과 응용기술 개발
 
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
 
[235]Wikipedia-scale Q&A
[235]Wikipedia-scale Q&A[235]Wikipedia-scale Q&A
[235]Wikipedia-scale Q&A
 
[244]로봇이 현실 세계에 대해 학습하도록 만들기
[244]로봇이 현실 세계에 대해 학습하도록 만들기[244]로봇이 현실 세계에 대해 학습하도록 만들기
[244]로봇이 현실 세계에 대해 학습하도록 만들기
 
[243] Deep Learning to help student’s Deep Learning
[243] Deep Learning to help student’s Deep Learning[243] Deep Learning to help student’s Deep Learning
[243] Deep Learning to help student’s Deep Learning
 
[234]Fast & Accurate Data Annotation Pipeline for AI applications
[234]Fast & Accurate Data Annotation Pipeline for AI applications[234]Fast & Accurate Data Annotation Pipeline for AI applications
[234]Fast & Accurate Data Annotation Pipeline for AI applications
 
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load BalancingOld version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing
 
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지
 
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
 
[224]네이버 검색과 개인화
[224]네이버 검색과 개인화[224]네이버 검색과 개인화
[224]네이버 검색과 개인화
 
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)
 
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
 
[213] Fashion Visual Search
[213] Fashion Visual Search[213] Fashion Visual Search
[213] Fashion Visual Search
 
[232] TensorRT를 활용한 딥러닝 Inference 최적화
[232] TensorRT를 활용한 딥러닝 Inference 최적화[232] TensorRT를 활용한 딥러닝 Inference 최적화
[232] TensorRT를 활용한 딥러닝 Inference 최적화
 
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
 
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터
 
[223]기계독해 QA: 검색인가, NLP인가?
[223]기계독해 QA: 검색인가, NLP인가?[223]기계독해 QA: 검색인가, NLP인가?
[223]기계독해 QA: 검색인가, NLP인가?
 

Recently uploaded

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 

Recently uploaded (20)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 

212 building googlebot - deview - google drive

  • 3. From the web to your query ● Query processing 1. Lookup keywords in the index => every relevant page 2. Rank pages and display the result ● Google's index of the web keyword => { page1, page2, ... } ● Building the index requires processing the current version of all of the pages on the web...
  • 4. All of the pages on the web!?!
  • 5. 60 Trillion Pages And Counting!
  • 6. Our local copy of the web ● Crawling ○ Googlebot ● Storage ○ Google File System (GFS), BigTable ● Processing ○ MapReduce ● Data Centers ○ Job control, Fault-Tolerance, High-Speed Networking, Power/Cooling, etc.
  • 7. Finding every page with googlebot ● Basic discovery crawl 1. Start with the set of known links 2. Crawl every link (pages change!) 3. Extract every new link, repeat Extract Links Crawl Status Web Page Crawl Pages
  • 8. Key considerations in crawling ● Polite crawling ○ Do not overload websites and DNS (no DoS!) ○ Understand web serving infrastructure ● Prioritize among discovered links ○ Crawl is a giant queuing system ○ Predicting serving capacity ● Do not waste resources ○ Ignore spam/broken links ○ Skip links with duplicate content
  • 9. Mirrors ● Hosts with exactly the same content deview.kr www.deview.kr ● Paths within hosts with the same content www.cs.unc.edu/Courses/comp426-f09/docs/tools/downloads/tomcat/ jakarta-tomcat-4.1.29/webapps/tomcat-docs www.cs.unc.edu/Courses/comp590-001-f08/docs/tools/downloads/tomcat/ jakarta-tomcat-4.1.29/webapps/tomcat-docs www.cs.unc.edu/Courses/comp590-001-f08/tools/downloads/tomcat/ jakarta-tomcat-4.1.29/webapps/tomcat-docs www.cs.unc.edu/Courses/jbs/tools/downloads/tomcat/ jakarta-tomcat/4.1.29/webapps/tomcat-docs ● Unrestricted mirroring across hosts and paths ○ Distributed graph mining
  • 10.
  • 11. Optimizing our crawling ● Efficient crawling requires duplicate handling ○ Predict whether a newly discovered link points to duplicate content ○ Must happen before crawling useful(link, status_table) => { yes, no }
  • 12. Duplicates in Dynamic Pages ● Duplicates are most common in dynamic links http://foo.com/forum/viewtopic.php?t=3808&sid=126bc5f2 http://foo.com/forum/viewtopic.php?t=3808&sid=d5b8483b http://foo.com/forum/viewtopic.php?t=3808&sid=3b1a8e27 http://foo.com/forum/viewtopic.php?t=3808&sid=2a21f059 ... ● Significance analysis ○ Parameter t is a relevant ○ Parameter sid is irrelevant ● Duplicate prediction http://foo.com/forum/viewtopic.php?t=3808&sid=ee5da24a Same Content
  • 13. Equivalence rules and class names ● Equivalence rule for a cluster ○ Set of relevant parameters ○ Set of irrelevant parameters ● Equivalence class name ○ Remove irrelevant parameters ECN(link1) = ECN(link2) => Same content! ○ For the previous example ECN(http://foo.com/forum/viewtopic.php?t=3808&sid=ee5da24a) = http://foo.com/forum/viewtopic.php?t=3808
  • 14. Modified crawl algorithm ● Representative table ○ Equivalence class name => representative link ● Given a new link 1. Identify cluster 2. Lookup equivalence rule 3. Apply rule to determine equivalence class name 4. Lookup table of representatives 5. Crawl link if no representative found
  • 15. Equivalence rule generation ● Find every crawled link under a cluster cluster = { link1 : content1, link2 : content2, ... } ● Study evidence 1. Insignificance analysis 2. Significance analysis 3. Parameter classification 4. Equivalence rule construction rule(cluster) = { param1 : RELEVANT, param2 : IRRELEVANT, param3 : CONFLICT, ... }
  • 16. 1. Insignificance analysis ● Group links by content content1 = { link11, link21, ... } content2 = { link21, link22, ... } ... ● For each parameter ○ For each content group with this parameter ■ If parameter values are not the same, add the number of links to the insignificance index
  • 17. 2. Significance analysis ● For each parameter ○ Remove the parameter from every link ■ Group content by remainder link remainder1 = { content11, content21, ... } remainder2 = { content21, content22, ... } ... ■ Increment significance index by the number of unique contents minus 1
  • 18. 3. Parameter classification ● For each parameter ○ Compute content relevance (or irrelevance) value Significance_Index Content_Relevance = Significance_Index + Insignificance_Index Insignificance_Index Content_Irrelevance = Significance_Index + Insignificance_Index ○ Sample criteria: 90/10 rule ■ If relevance > 90 => parameter is RELEVANT ■ If relevance < 10 => parameter is IRRELEVANT ■ Otherwise, parameter is CONFLICT
  • 19. Example: P is content-irrelevant http://foo.com/directory?P=1&Q=3 http://foo.com/directory?P=2&Q=3 Cluster Content A http://foo.com/directory?P=1&Q=2 http://foo.com/directory?P=2&Q=2 http://foo.com/directory?P=3&Q=2 http://foo.com/directory?P=4&Q=2 Content B Insignificance Analysis of P Significance Analysis of P Content A Content B Q=3 Q=2 2 links, different Ps 4 links, different Ps 2 links, Content A 4 links, Content B P's insignificance index = 2 + 4 = 6 P's content-irrelevance value = 100% P's significance index = 0 P's content-relevance value = 0%
  • 20. Example: Q is content-relevant http://foo.com/directory?P=1&Q=3 http://foo.com/directory?P=2&Q=3 Cluster Content A http://foo.com/directory?P=1&Q=2 http://foo.com/directory?P=2&Q=2 http://foo.com/directory?P=3&Q=2 http://foo.com/directory?P=4&Q=2 Content B Insignificance Analysis of Q Significance Analysis of Q Content A Content B P=1 P=2 2 links, same Q 4 links, same Q 2 links, Content A&B 2 links, Content A&B Q's insignificance index = 0 Q's content-irrelevance value = 0% Q's significance index = 1 + 1 = 2 Q's content-relevance value = 100%
  • 21. Facing the Real World ● Limitations ○ Co-changing parameters ○ Noisy data ○ Parameters not used in the standard way ○ Need for continuous validation ● State-of-the-art ○ White-box vs black-box ● Search is not solved ○ Not even crawling is solved!
  • 22. Defining duplicates ● Identical pages ● Identical visible content ● Essentially identical visible content ○ Ignore page generation time ○ Ignore breaking news side bar ○ etc. ● What is the right answer? Two pages should be considered duplicates if our users would consider them duplicates ● How to translate this notion into a checksum?
  • 23. Q&A
  • 24.