SlideShare uma empresa Scribd logo
1 de 23
Building a scalable distributed  WWW search engine  … NOT in Perl! Presented by Alex Chudnovsky (http://www.majectic12.co.uk)  at Birmingham Perl Mongers User Group  (http://birmingham.pm.org) V1.0 27/07/05
Contents ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
History (of my work in area of information retrieval) ,[object Object],[object Object],[object Object]
Goals ,[object Object],[object Object],[object Object]
Architecture ,[object Object],[object Object],[object Object],[object Object]
Data collection (crawling) Base Issues URLs to crawl and receives compressed pages Distributed c rawlers – receive lists of URLs to crawl, crawl them and send back compressed data. In the future will do distributed indexing Note: this stage is optional if you already have data to index, ie list of products with their descriptions
Crawler screenshot 1
Crawler screenshot 2
Crawler screenshot 3
Crawler screenshot 4
Crawler screenshot 5
Current Stats Source:  http://www.majestic12.co.uk/projects/dsearch/stats.php  as of 27/07/05
Indexing Indexing is a process of turning words into numbers and creating inverted index. Data barrel Doc #0: Birmingham Perl Mongers Doc #1: Birmingham City Doc #2: Perl City Lexicon (maps words to their numeric WordIDs) Birmingham – 0 Perl  – 1 Mongers  – 2 City  – 3  Inverted Index (Each of the WordID has list of  (ideally sorted) DocIDs) 0  -> 0, 1 1  -> 0, 2 2  -> 0, 3  -> 1, 2 Note: if you use database then it make sense to have clustered index on WordID
Merging Individual indexed barrels Single searchable index Note: this stage is not necessary if just one barrel is used as there will be no need to remap all Ids from local to their global equivalents.
Searching Searching is a process of finding documents that contain words from search query Doc #0: Birmingham Perl Mongers Doc #1: Birmingham City Doc #2: Perl City Lexicon (maps words to their numeric WordIDs) Birmingham – 0 Perl  – 1 Mongers  – 2 City  – 3  Inverted Index (lists DocIDs for each of the WordID) 0  -> 0, 1 1  -> 0, 2 2  -> 0, 3  -> 1, 2 Note: if you use database then it make sense to cluster on WordID Search query:  “Birmingham Perl” WordIDs: 0, 1 Intersection of DocIDs present in both lists (implementation of boolean AND logic): Not matched! 2 n/a Not matched! n/a 1 Matched! 0 0 Result 1 (Perl) 0 (Brum)
Search engine screenshot 1
Search engine screenshot 2
Implementation ,[object Object],[object Object],[object Object]
Why not Perl? (using C # instead) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Conclusions ,[object Object],[object Object],[object Object],[object Object],[object Object]
Credits ,[object Object],[object Object],* Volunteers running crawler and who crawled at least 1 mln URLs as of 27/07/05
Recommended reading ,[object Object],[object Object]
Join! Join the project  (unmetered broadband required!):  majestic12.co.uk Your name could be here!

Mais conteúdo relacionado

Mais procurados

Improving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixImproving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch Fix
Stitch Fix Algorithms
 
Designing a generic Python Search Engine API - BarCampLondon 8
Designing a generic Python Search Engine API - BarCampLondon 8Designing a generic Python Search Engine API - BarCampLondon 8
Designing a generic Python Search Engine API - BarCampLondon 8
Richard Boulton
 
Week 2-after
Week 2-afterWeek 2-after
Week 2-after
jnand
 
Wikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big Data
Vivian S. Zhang
 
Final Presentation IRT - Jingxuan Wei V1.2
Final Presentation  IRT - Jingxuan Wei V1.2Final Presentation  IRT - Jingxuan Wei V1.2
Final Presentation IRT - Jingxuan Wei V1.2
JINGXUAN WEI
 

Mais procurados (20)

EuroPython 2020 - Speak python with devices
EuroPython 2020 - Speak python with devicesEuroPython 2020 - Speak python with devices
EuroPython 2020 - Speak python with devices
 
SFrame
SFrameSFrame
SFrame
 
Improving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixImproving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch Fix
 
Documenting an API written in Django Rest Framework
Documenting an API written in Django Rest FrameworkDocumenting an API written in Django Rest Framework
Documenting an API written in Django Rest Framework
 
Designing a generic Python Search Engine API - BarCampLondon 8
Designing a generic Python Search Engine API - BarCampLondon 8Designing a generic Python Search Engine API - BarCampLondon 8
Designing a generic Python Search Engine API - BarCampLondon 8
 
Optimizing Spark
Optimizing SparkOptimizing Spark
Optimizing Spark
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with R
 
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
 
Data Pipelining Across AWS and GCP
Data Pipelining Across AWS and GCPData Pipelining Across AWS and GCP
Data Pipelining Across AWS and GCP
 
Week 2-after
Week 2-afterWeek 2-after
Week 2-after
 
Get Started with CrateDB: Sensor Data
Get Started with CrateDB: Sensor DataGet Started with CrateDB: Sensor Data
Get Started with CrateDB: Sensor Data
 
Sphinx
SphinxSphinx
Sphinx
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with R
 
An Introduction to MongoDB Compass
An Introduction to MongoDB CompassAn Introduction to MongoDB Compass
An Introduction to MongoDB Compass
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with R
 
Wikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big Data
 
Congressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4jCongressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4j
 
Intro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariIntro to Python Data Analysis in Wakari
Intro to Python Data Analysis in Wakari
 
Final Presentation IRT - Jingxuan Wei V1.2
Final Presentation  IRT - Jingxuan Wei V1.2Final Presentation  IRT - Jingxuan Wei V1.2
Final Presentation IRT - Jingxuan Wei V1.2
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with R
 

Destaque

C1 Introducere Sistem1
C1 Introducere Sistem1C1 Introducere Sistem1
C1 Introducere Sistem1
antropologie
 
Specificatii Eseu Fotografic Observatie Antropologie 2009
Specificatii Eseu Fotografic Observatie Antropologie 2009Specificatii Eseu Fotografic Observatie Antropologie 2009
Specificatii Eseu Fotografic Observatie Antropologie 2009
antropologie
 
C3 EvoluţIa Sistemului1
C3 EvoluţIa Sistemului1C3 EvoluţIa Sistemului1
C3 EvoluţIa Sistemului1
antropologie
 
Nätverket 24 timmarswebben
Nätverket 24 timmarswebbenNätverket 24 timmarswebben
Nätverket 24 timmarswebben
Björn Hagström
 
C3 4 EvoluţIa Sistemului1
C3 4 EvoluţIa Sistemului1C3 4 EvoluţIa Sistemului1
C3 4 EvoluţIa Sistemului1
antropologie
 

Destaque (9)

C1 Introducere Sistem1
C1 Introducere Sistem1C1 Introducere Sistem1
C1 Introducere Sistem1
 
Kick off presentation
Kick off presentationKick off presentation
Kick off presentation
 
Specificatii Eseu Fotografic Observatie Antropologie 2009
Specificatii Eseu Fotografic Observatie Antropologie 2009Specificatii Eseu Fotografic Observatie Antropologie 2009
Specificatii Eseu Fotografic Observatie Antropologie 2009
 
Page Rank
Page RankPage Rank
Page Rank
 
C3 EvoluţIa Sistemului1
C3 EvoluţIa Sistemului1C3 EvoluţIa Sistemului1
C3 EvoluţIa Sistemului1
 
Nätverket 24 timmarswebben
Nätverket 24 timmarswebbenNätverket 24 timmarswebben
Nätverket 24 timmarswebben
 
Kathleen & nina
Kathleen & ninaKathleen & nina
Kathleen & nina
 
C3 4 EvoluţIa Sistemului1
C3 4 EvoluţIa Sistemului1C3 4 EvoluţIa Sistemului1
C3 4 EvoluţIa Sistemului1
 
Diviziuni Sociale
Diviziuni SocialeDiviziuni Sociale
Diviziuni Sociale
 

Semelhante a Www Search Engine But Not In Perl

Accra MongoDB User Group
Accra MongoDB User GroupAccra MongoDB User Group
Accra MongoDB User Group
MongoDB
 
Page 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docxPage 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docx
smile790243
 
Beginning MEAN Stack
Beginning MEAN StackBeginning MEAN Stack
Beginning MEAN Stack
Rob Davarnia
 
ALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch CouncilALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch Council
Sunita Shrivastava
 
Mongo db first steps with csharp
Mongo db first steps with csharpMongo db first steps with csharp
Mongo db first steps with csharp
Serdar Buyuktemiz
 

Semelhante a Www Search Engine But Not In Perl (20)

MongoATL: How Sourceforge is Using MongoDB
MongoATL: How Sourceforge is Using MongoDBMongoATL: How Sourceforge is Using MongoDB
MongoATL: How Sourceforge is Using MongoDB
 
What's new in spark 2.0?
What's new in spark 2.0?What's new in spark 2.0?
What's new in spark 2.0?
 
Accra MongoDB User Group
Accra MongoDB User GroupAccra MongoDB User Group
Accra MongoDB User Group
 
Page 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docxPage 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docx
 
Mongo db
Mongo dbMongo db
Mongo db
 
Beginning MEAN Stack
Beginning MEAN StackBeginning MEAN Stack
Beginning MEAN Stack
 
About elasticsearch
About elasticsearchAbout elasticsearch
About elasticsearch
 
Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8
 
MySQL And Search At Craigslist
MySQL And Search At CraigslistMySQL And Search At Craigslist
MySQL And Search At Craigslist
 
ALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch CouncilALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch Council
 
Techorama - Evolvable Application Development with MongoDB
Techorama  - Evolvable Application Development with MongoDBTechorama  - Evolvable Application Development with MongoDB
Techorama - Evolvable Application Development with MongoDB
 
Mongo db first steps with csharp
Mongo db first steps with csharpMongo db first steps with csharp
Mongo db first steps with csharp
 
MongoDB Basics
MongoDB BasicsMongoDB Basics
MongoDB Basics
 
Uma SunilKumar Resume
Uma SunilKumar ResumeUma SunilKumar Resume
Uma SunilKumar Resume
 
Node Js, AngularJs and Express Js Tutorial
Node Js, AngularJs and Express Js TutorialNode Js, AngularJs and Express Js Tutorial
Node Js, AngularJs and Express Js Tutorial
 
Reume IT
Reume ITReume IT
Reume IT
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Dev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDBDev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDB
 
Search Approach - ES, GraphDB
Search Approach - ES, GraphDBSearch Approach - ES, GraphDB
Search Approach - ES, GraphDB
 
NoSql Databases
NoSql DatabasesNoSql Databases
NoSql Databases
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

Www Search Engine But Not In Perl

  • 1. Building a scalable distributed WWW search engine … NOT in Perl! Presented by Alex Chudnovsky (http://www.majectic12.co.uk) at Birmingham Perl Mongers User Group (http://birmingham.pm.org) V1.0 27/07/05
  • 2.
  • 3.
  • 4.
  • 5.
  • 6. Data collection (crawling) Base Issues URLs to crawl and receives compressed pages Distributed c rawlers – receive lists of URLs to crawl, crawl them and send back compressed data. In the future will do distributed indexing Note: this stage is optional if you already have data to index, ie list of products with their descriptions
  • 12. Current Stats Source: http://www.majestic12.co.uk/projects/dsearch/stats.php as of 27/07/05
  • 13. Indexing Indexing is a process of turning words into numbers and creating inverted index. Data barrel Doc #0: Birmingham Perl Mongers Doc #1: Birmingham City Doc #2: Perl City Lexicon (maps words to their numeric WordIDs) Birmingham – 0 Perl – 1 Mongers – 2 City – 3 Inverted Index (Each of the WordID has list of (ideally sorted) DocIDs) 0 -> 0, 1 1 -> 0, 2 2 -> 0, 3 -> 1, 2 Note: if you use database then it make sense to have clustered index on WordID
  • 14. Merging Individual indexed barrels Single searchable index Note: this stage is not necessary if just one barrel is used as there will be no need to remap all Ids from local to their global equivalents.
  • 15. Searching Searching is a process of finding documents that contain words from search query Doc #0: Birmingham Perl Mongers Doc #1: Birmingham City Doc #2: Perl City Lexicon (maps words to their numeric WordIDs) Birmingham – 0 Perl – 1 Mongers – 2 City – 3 Inverted Index (lists DocIDs for each of the WordID) 0 -> 0, 1 1 -> 0, 2 2 -> 0, 3 -> 1, 2 Note: if you use database then it make sense to cluster on WordID Search query: “Birmingham Perl” WordIDs: 0, 1 Intersection of DocIDs present in both lists (implementation of boolean AND logic): Not matched! 2 n/a Not matched! n/a 1 Matched! 0 0 Result 1 (Perl) 0 (Brum)
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23. Join! Join the project (unmetered broadband required!): majestic12.co.uk Your name could be here!