1. The document summarizes the key components of a search engine's structure including the crawling, indexing, and ranking processes.
2. It describes how crawlers retrieve web pages from the internet and store them in a repository. Indexes are then created from the repository including a document index, lexicon, inverted index, URL list, and link index to organize and relate the data.
3. The ranking process is also summarized, including page rank which determines importance based on the number and quality of links between pages. The overall structure brings these components together to enable user searches and retrieval of relevant results from the back-end through the search server interface.
9. Search Engine
Structure
Back-end’s Roll
• Crawling
•Web page 수집해 오는 기술
Back-
Search
Index
Server end
•많은 시간 -> 복수의 crawler 사용
•수집한 것을 Repository에 보관
• Creating Index
•Repository에 저장된 web page
로 Index를 만들어 냄
•구조분석, 단어처리, 링크 처리
랭킹 등
12. Search Engine
Structure
Back-end Structure
URL
server
crawler
Crawler
crawler
URL server 가 전체 crawler 지휘
각 crawler는 지시에 따라 crawler
Internet
Web Page download
Repository에 임시 저장
• docID – 고유 숫자 값
Repository
• url – URL
• text – 압축물
• etc. – date, page length…
13. Search Engine
Structure
Back-end Structure
URL
server
crawler
Crawler
crawler
주소해석이 시간 많이 소요
-> 내부에 DNS cache 관리
crawler
Internet
Repository에 저장후
URL server가 다음주소 할당
Repository
14. Search Engine
Structure
Back-end Structure
docID Sejong.ac.k
url r
<html>
1
<head>
Creating Index <title>세종대학교</title>
</body>
<h1>학사정보<h1>
세종대학교
Title
….
기타 …
Analyzing Web Page structures
DocIndex
– Web Page의 기본정보 저장
– docID를 key로 사용
DocIndex URLlist
URLlist
– url을 key로 사용 docID url title etc. url docID
– docID를 가져오기 위함
15. Search Engine Lexicon
Structure
word wordID
Back-end Structure
세종 101
Barrels
대학교 102
학사 201
Creating Index 정보 202
Barrels
docID wordID#1 Position#1 Size#1 Etc.#1
Word Index
Position#2 Size#2 Etc.#2
Lexicon wordID#2 Position#1 Size#1 Etc.#1
– word -> wordID
Position#2 Size#2 Etc.#2
…
Barrels
– docID wordID position size etc.
Inverted Index
– wordID를 Key로 사용
16. Search Engine
Structure
Back-end Structure
docID Sejong.ac.k
docID 3
Creating Index url r
url Cyworld.com
1
Link
Link Index
URLlist
URLlist
Links Links
Sejong.ac.kr 1 1 3
Cyworld.com 3
Anchortext
- A information of linked page
18. Search Engine
Structure
DocIndex
Index Structure
Lexicon
DocIndex
– Web Page의 기본정보 저장
– docID를 key로 사용
Lexicon
– word -> wordID
Barrels
Barrels
– storages
19. Total Structure
User
Index Back-end Internet
crawler
DocIndex
Search
Server crawler
Lexicon
crawler
Structure
URL
server
word
Barrels
Barrels
Barrels Repository
Link
URLlist
Ranking
Links