SlideShare uma empresa Scribd logo
1 de 22
Baixar para ler offline
Search Engine
                             How To Make it




Wednesday, December 12, 12
Search Engine
                      Search Quality Measurement

                                                             retrieved documents
                                                             (RET)
                      relevant documents       RET ∩ REL
                      (REL)




                                            All documents




             database search:              web search:
             - low recall                  - high recall
             - high precision              - low precision
Wednesday, December 12, 12
Search Engine
                        File System
            File                                    Text Parser
                          Crawler
           System

                                                                      Documents
                                      AaBb                               (title,                      Documents
                                          PDF
                                       AaBb                                              Document
          3rd party
            apps
                        Crawler API
                                         Text
                                                    HTML Parser
                                                                      summary,           Enhancing   (Categorized,
                                         HTML
                                       Document
                                                                        author,                      Taxonomized)
                                        Image
                                           ...
                                                                       datetime)
                         Database
          Database        Crawler                   PDF Parser




                                                                                  Language
                                                                                                       Indexer       Stop Analyzer
                                                                                  Analyzer




                                                                   Web Client
                                                                                        Index
                       Document           Index                                        Searcher         Index

                      Landing Page       Searcher                 Mobile Client




Wednesday, December 12, 12
Search Engine

                   • Process in Search Engine
                        • Crawling
                        • Parsing
                        • Indexing
                        • Searching


Wednesday, December 12, 12
Search Engine
                   • Process in Search Engine
                        • Crawling
                        • Parsing
                        • Duplicate Content Detection
                        • Document Enhancement
                        • Indexing
                        • Searching
                        • Document Serving
Wednesday, December 12, 12
Search Engine

                   • Crawling
                        • Collecting Data
                        • Input : Data content to Search
                        • Output : Raw Content Data in its
                          original format



Wednesday, December 12, 12
Search Engine
                   • Crawling

                                         File System
                              File
                                           Crawler
                             System




                                                       AaBb
                             3rd party   Crawler API       PDF
                                                        AaBb
                               apps                       Text
                                                          HTML
                                                        Document
                                                         Image
                                                            ...
                                          Database
                             Database      Crawler




Wednesday, December 12, 12
Search Engine
                   • Parsing
                        • Process to extract elements from
                          crawled documents
                        • Input : Raw Contents
                        • Output : Textual Structured
                          Documents


Wednesday, December 12, 12
Search Engine
                   • Parsing


                                         Text Parser



                                                       Documents
                             AaBb                         (title,
                                 PDF
                              AaBb
                                Text
                                         HTML Parser
                                                       summary,
                                HTML
                              Document                   author,
                               Image
                                  ...
                                                        datetime)
                                         PDF Parser




Wednesday, December 12, 12
Search Engine

                   • Content Duplication Detection
                        • Bigger Data means Bigger
                          Duplication on Data
                        • Search Engine implement similiar
                          document detection



Wednesday, December 12, 12
Search Engine
                   • Document Representation
                             Model: Term Frequency(Tf)
                             Contoh:
                              Document 1(d1)=”andi likes to watch movie. His wife likes it too”

                              Document 2(d2)=”andi also likes to watch soccer game.”
                              Dictionary={1:andi, 2:likes, 3:watch, 4:movie, 5:wife, 6:too, 7:soccer}


                              Document representation in model Tf:
                              d1={1, 2, 2, 2, 1, 1, 0}
                              d2={1, 1, 1, 0, 0, 0, 1}




Wednesday, December 12, 12
Search Engine
                   • Document Similiarity
                             Similarity between document d1 dan d2 : S(d1, d2)

                             S(d1, d2)=|d1-d2|
                             Contoh:
                             d1={1, 2, 2, 2, 1, 1, 0}

                             d2={1, 1, 1, 0, 0, 0, 1}

                              S(d1, d2)=|1-1|+|2-1|+|2-1|+|2-0|+|1-0|+|1-0|+|0-1|

                             S(d1, d2)=7

                             With above definition, less value we got means more those two documents
                             are getting more similiar

Wednesday, December 12, 12
Search Engine
                   • Alghoritms
                             1. Counting Tf for every document

                             2. Find the smallest value of S(d, di) from all
                             documents collection to get the most similiar of
                             document d
                             3. if the value of S(d, di) < threshold then
                             document d and compared with create date, then
                             erase older document
                             4. Repeat process 2 dan 3 until there is no value
                             of S that less than Theshold


Wednesday, December 12, 12
Search Engine


                   • Document Enhancement
                        • Give tagging based on taxonomy




Wednesday, December 12, 12
Search Engine
                   • Document Enhancement



                         Documents
                            (title,                Documents
                                      Document
                         summary,     Enhancing
                                                  (Categorized,
                           author,                Taxonomized)
                          datetime)




Wednesday, December 12, 12
Search Engine
                   • Indexing
                        • Indexing process from all information
                          that have been gathered in one
                          document
                             • Faster Searching process
                             • Able to search based on certain field


Wednesday, December 12, 12
Search Engine
                   • Indexing
                                              Language
                                              Analyzer




                              Documents
                             (Categorized,     Indexer       Index
                             Taxonomized)




                                             Stop Analyzer
Wednesday, December 12, 12
Search Engine
                   • Searching



                                                 Web Client
                                      Index
                             Index   Searcher
                                                Mobile Client




Wednesday, December 12, 12
Search Engine

                   • Document Serving
                        • Search Engine also has a function to
                          display result




Wednesday, December 12, 12
Search Engine


                                         Web Client
                              Index                      Index      Document
         Index               Searcher                   Searcher   Landing Page
                                        Mobile Client




Wednesday, December 12, 12
Search Engine
                   • Recommended Open Source
                     Technology
                             • Search Engine : Lucene, Nutch

                             • Programming Library : Hadoop, Scala Actor

                             • Database : MongoDB, PostgreSQL

                             • Programming Language : Java, Scala, PHP




Wednesday, December 12, 12
Thank You



Wednesday, December 12, 12

Mais conteúdo relacionado

Mais procurados

Organizing the Data Chaos of Scientists
Organizing the Data Chaos of ScientistsOrganizing the Data Chaos of Scientists
Organizing the Data Chaos of ScientistsAndreas Schreiber
 
Implementing and Visualizing Clickstream data with MongoDB
Implementing and Visualizing Clickstream data with MongoDBImplementing and Visualizing Clickstream data with MongoDB
Implementing and Visualizing Clickstream data with MongoDBMongoDB
 
Webinar: Search and Recommenders
Webinar: Search and RecommendersWebinar: Search and Recommenders
Webinar: Search and RecommendersLucidworks
 
Best Practices for SharePoint 2010 Search
Best Practices for SharePoint 2010 SearchBest Practices for SharePoint 2010 Search
Best Practices for SharePoint 2010 SearchAgnes Molnar
 
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...Cory Lampert
 
Click-through relevance ranking in solr &  lucid works enterprise - By Andrz...
 Click-through relevance ranking in solr &  lucid works enterprise - By Andrz... Click-through relevance ranking in solr &  lucid works enterprise - By Andrz...
Click-through relevance ranking in solr &  lucid works enterprise - By Andrz...lucenerevolution
 
Improve Performance in Fast Search for SharePoint - Comperio
Improve Performance in Fast Search for SharePoint - ComperioImprove Performance in Fast Search for SharePoint - Comperio
Improve Performance in Fast Search for SharePoint - ComperioComperio - Search Matters.
 
Applied Semantic Search with Microsoft SQL Server
Applied Semantic Search with Microsoft SQL ServerApplied Semantic Search with Microsoft SQL Server
Applied Semantic Search with Microsoft SQL ServerMark Tabladillo
 
The Real-time Web in the Age of Agents
The Real-time Web in the Age of AgentsThe Real-time Web in the Age of Agents
The Real-time Web in the Age of AgentsJoshua Shinavier
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platformmteutelink
 
MongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big DataMongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big DataMongoDB
 
Analyzing Web Archives
Analyzing Web ArchivesAnalyzing Web Archives
Analyzing Web Archivesvinaygo
 
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsGeorge Stathis
 
Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...
Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...
Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...William Ulate
 
The openCypher Project - An Open Graph Query Language
The openCypher Project - An Open Graph Query LanguageThe openCypher Project - An Open Graph Query Language
The openCypher Project - An Open Graph Query LanguageNeo4j
 
Big data analysis using spark r published
Big data analysis using spark r publishedBig data analysis using spark r published
Big data analysis using spark r publishedDipendra Kusi
 
Smarter share point kc user group fast presentation march 2015
Smarter share point kc user group fast presentation   march 2015Smarter share point kc user group fast presentation   march 2015
Smarter share point kc user group fast presentation march 2015Kyle Bodenstab
 

Mais procurados (18)

Organizing the Data Chaos of Scientists
Organizing the Data Chaos of ScientistsOrganizing the Data Chaos of Scientists
Organizing the Data Chaos of Scientists
 
Implementing and Visualizing Clickstream data with MongoDB
Implementing and Visualizing Clickstream data with MongoDBImplementing and Visualizing Clickstream data with MongoDB
Implementing and Visualizing Clickstream data with MongoDB
 
Webinar: Search and Recommenders
Webinar: Search and RecommendersWebinar: Search and Recommenders
Webinar: Search and Recommenders
 
Best Practices for SharePoint 2010 Search
Best Practices for SharePoint 2010 SearchBest Practices for SharePoint 2010 Search
Best Practices for SharePoint 2010 Search
 
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
 
Click-through relevance ranking in solr &  lucid works enterprise - By Andrz...
 Click-through relevance ranking in solr &  lucid works enterprise - By Andrz... Click-through relevance ranking in solr &  lucid works enterprise - By Andrz...
Click-through relevance ranking in solr &  lucid works enterprise - By Andrz...
 
Improve Performance in Fast Search for SharePoint - Comperio
Improve Performance in Fast Search for SharePoint - ComperioImprove Performance in Fast Search for SharePoint - Comperio
Improve Performance in Fast Search for SharePoint - Comperio
 
Applied Semantic Search with Microsoft SQL Server
Applied Semantic Search with Microsoft SQL ServerApplied Semantic Search with Microsoft SQL Server
Applied Semantic Search with Microsoft SQL Server
 
Graph Databases
Graph DatabasesGraph Databases
Graph Databases
 
The Real-time Web in the Age of Agents
The Real-time Web in the Age of AgentsThe Real-time Web in the Age of Agents
The Real-time Web in the Age of Agents
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platform
 
MongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big DataMongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big Data
 
Analyzing Web Archives
Analyzing Web ArchivesAnalyzing Web Archives
Analyzing Web Archives
 
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data Lessons
 
Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...
Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...
Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...
 
The openCypher Project - An Open Graph Query Language
The openCypher Project - An Open Graph Query LanguageThe openCypher Project - An Open Graph Query Language
The openCypher Project - An Open Graph Query Language
 
Big data analysis using spark r published
Big data analysis using spark r publishedBig data analysis using spark r published
Big data analysis using spark r published
 
Smarter share point kc user group fast presentation march 2015
Smarter share point kc user group fast presentation   march 2015Smarter share point kc user group fast presentation   march 2015
Smarter share point kc user group fast presentation march 2015
 

Destaque

Getting more from Google Analytics
Getting more from Google AnalyticsGetting more from Google Analytics
Getting more from Google AnalyticsFind50 Marketing
 
Organic Web Search - why it matters.
Organic Web Search - why it matters.Organic Web Search - why it matters.
Organic Web Search - why it matters.Find50 Marketing
 
Isaac's Health Care Archimedes talk at Apollo College
Isaac's Health Care Archimedes talk at Apollo CollegeIsaac's Health Care Archimedes talk at Apollo College
Isaac's Health Care Archimedes talk at Apollo CollegeIsaac Holeman
 
Increasing and Improving your web traffic
Increasing and Improving your web trafficIncreasing and Improving your web traffic
Increasing and Improving your web trafficFind50 Marketing
 

Destaque (6)

Getting more from Google Analytics
Getting more from Google AnalyticsGetting more from Google Analytics
Getting more from Google Analytics
 
Introduction To Ad Words
Introduction To Ad WordsIntroduction To Ad Words
Introduction To Ad Words
 
Organic Web Search - why it matters.
Organic Web Search - why it matters.Organic Web Search - why it matters.
Organic Web Search - why it matters.
 
Isaac's Health Care Archimedes talk at Apollo College
Isaac's Health Care Archimedes talk at Apollo CollegeIsaac's Health Care Archimedes talk at Apollo College
Isaac's Health Care Archimedes talk at Apollo College
 
Increasing and Improving your web traffic
Increasing and Improving your web trafficIncreasing and Improving your web traffic
Increasing and Improving your web traffic
 
Better Digital Marketing
Better Digital MarketingBetter Digital Marketing
Better Digital Marketing
 

Semelhante a How To Measure Search Quality

Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time ApplicationsDataWorks Summit
 
10 Things I Like in SharePoint 2013 Search
10 Things I Like in SharePoint 2013 Search10 Things I Like in SharePoint 2013 Search
10 Things I Like in SharePoint 2013 SearchSPC Adriatics
 
SPCAdriatics - 10 Things I Like In SharePoint 2013 Search
SPCAdriatics - 10 Things I Like In SharePoint 2013 SearchSPCAdriatics - 10 Things I Like In SharePoint 2013 Search
SPCAdriatics - 10 Things I Like In SharePoint 2013 SearchAgnes Molnar
 
SharePoint 2013 Search Architecture with Russ Houberg
SharePoint 2013  Search Architecture with Russ HoubergSharePoint 2013  Search Architecture with Russ Houberg
SharePoint 2013 Search Architecture with Russ Houbergknowledgelakemarketing
 
Hw09 Terapot Email Archiving With Hadoop
Hw09   Terapot  Email Archiving With HadoopHw09   Terapot  Email Archiving With Hadoop
Hw09 Terapot Email Archiving With HadoopCloudera, Inc.
 
Labmatrix Slides 2011 05
Labmatrix Slides 2011 05Labmatrix Slides 2011 05
Labmatrix Slides 2011 05bhughes26
 
Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic WebNuxeo
 
Search, APIs, capability management and Sensis's journey
Search, APIs, capability management and Sensis's journeySearch, APIs, capability management and Sensis's journey
Search, APIs, capability management and Sensis's journeyablebagel
 
How to SEO a Terrific - and Profitable - User Experience
How to SEO a Terrific - and Profitable - User ExperienceHow to SEO a Terrific - and Profitable - User Experience
How to SEO a Terrific - and Profitable - User ExperienceBrightEdge
 
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...Amazon Web Services
 
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Amazon Web Services
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017Amazon Web Services
 
Planning SharePoint 2013 Search for IT PROs
Planning SharePoint 2013 Search for IT PROsPlanning SharePoint 2013 Search for IT PROs
Planning SharePoint 2013 Search for IT PROsBenjamin Athawes
 
"Search, APIs,Capability Management and the Sensis Journey"
"Search, APIs,Capability Management and the Sensis Journey""Search, APIs,Capability Management and the Sensis Journey"
"Search, APIs,Capability Management and the Sensis Journey"Lucidworks (Archived)
 
BDA305 Building Data Lakes and Analytics on AWS
BDA305 Building Data Lakes and Analytics on AWSBDA305 Building Data Lakes and Analytics on AWS
BDA305 Building Data Lakes and Analytics on AWSAmazon Web Services
 
AWS Update | London - Amazon CloudSearch
AWS Update | London - Amazon CloudSearchAWS Update | London - Amazon CloudSearch
AWS Update | London - Amazon CloudSearchAmazon Web Services
 
Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2Amazon Web Services
 

Semelhante a How To Measure Search Quality (20)

FAST Search for SharePoint
FAST Search for SharePointFAST Search for SharePoint
FAST Search for SharePoint
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
 
10 Things I Like in SharePoint 2013 Search
10 Things I Like in SharePoint 2013 Search10 Things I Like in SharePoint 2013 Search
10 Things I Like in SharePoint 2013 Search
 
SPCAdriatics - 10 Things I Like In SharePoint 2013 Search
SPCAdriatics - 10 Things I Like In SharePoint 2013 SearchSPCAdriatics - 10 Things I Like In SharePoint 2013 Search
SPCAdriatics - 10 Things I Like In SharePoint 2013 Search
 
SharePoint 2013 Search Architecture with Russ Houberg
SharePoint 2013  Search Architecture with Russ HoubergSharePoint 2013  Search Architecture with Russ Houberg
SharePoint 2013 Search Architecture with Russ Houberg
 
Hw09 Terapot Email Archiving With Hadoop
Hw09   Terapot  Email Archiving With HadoopHw09   Terapot  Email Archiving With Hadoop
Hw09 Terapot Email Archiving With Hadoop
 
Labmatrix Slides 2011 05
Labmatrix Slides 2011 05Labmatrix Slides 2011 05
Labmatrix Slides 2011 05
 
Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic Web
 
Search, APIs, capability management and Sensis's journey
Search, APIs, capability management and Sensis's journeySearch, APIs, capability management and Sensis's journey
Search, APIs, capability management and Sensis's journey
 
How to SEO a Terrific - and Profitable - User Experience
How to SEO a Terrific - and Profitable - User ExperienceHow to SEO a Terrific - and Profitable - User Experience
How to SEO a Terrific - and Profitable - User Experience
 
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
 
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
 
Planning SharePoint 2013 Search for IT PROs
Planning SharePoint 2013 Search for IT PROsPlanning SharePoint 2013 Search for IT PROs
Planning SharePoint 2013 Search for IT PROs
 
"Search, APIs,Capability Management and the Sensis Journey"
"Search, APIs,Capability Management and the Sensis Journey""Search, APIs,Capability Management and the Sensis Journey"
"Search, APIs,Capability Management and the Sensis Journey"
 
BDA305 Building Data Lakes and Analytics on AWS
BDA305 Building Data Lakes and Analytics on AWSBDA305 Building Data Lakes and Analytics on AWS
BDA305 Building Data Lakes and Analytics on AWS
 
AWS Update | London - Amazon CloudSearch
AWS Update | London - Amazon CloudSearchAWS Update | London - Amazon CloudSearch
AWS Update | London - Amazon CloudSearch
 
Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2
 
Arakno
AraknoArakno
Arakno
 

Último

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 

Último (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 

How To Measure Search Quality

  • 1. Search Engine How To Make it Wednesday, December 12, 12
  • 2. Search Engine Search Quality Measurement retrieved documents (RET) relevant documents RET ∩ REL (REL) All documents database search: web search: - low recall - high recall - high precision - low precision Wednesday, December 12, 12
  • 3. Search Engine File System File Text Parser Crawler System Documents AaBb (title, Documents PDF AaBb Document 3rd party apps Crawler API Text HTML Parser summary, Enhancing (Categorized, HTML Document author, Taxonomized) Image ... datetime) Database Database Crawler PDF Parser Language Indexer Stop Analyzer Analyzer Web Client Index Document Index Searcher Index Landing Page Searcher Mobile Client Wednesday, December 12, 12
  • 4. Search Engine • Process in Search Engine • Crawling • Parsing • Indexing • Searching Wednesday, December 12, 12
  • 5. Search Engine • Process in Search Engine • Crawling • Parsing • Duplicate Content Detection • Document Enhancement • Indexing • Searching • Document Serving Wednesday, December 12, 12
  • 6. Search Engine • Crawling • Collecting Data • Input : Data content to Search • Output : Raw Content Data in its original format Wednesday, December 12, 12
  • 7. Search Engine • Crawling File System File Crawler System AaBb 3rd party Crawler API PDF AaBb apps Text HTML Document Image ... Database Database Crawler Wednesday, December 12, 12
  • 8. Search Engine • Parsing • Process to extract elements from crawled documents • Input : Raw Contents • Output : Textual Structured Documents Wednesday, December 12, 12
  • 9. Search Engine • Parsing Text Parser Documents AaBb (title, PDF AaBb Text HTML Parser summary, HTML Document author, Image ... datetime) PDF Parser Wednesday, December 12, 12
  • 10. Search Engine • Content Duplication Detection • Bigger Data means Bigger Duplication on Data • Search Engine implement similiar document detection Wednesday, December 12, 12
  • 11. Search Engine • Document Representation Model: Term Frequency(Tf) Contoh: Document 1(d1)=”andi likes to watch movie. His wife likes it too” Document 2(d2)=”andi also likes to watch soccer game.” Dictionary={1:andi, 2:likes, 3:watch, 4:movie, 5:wife, 6:too, 7:soccer} Document representation in model Tf: d1={1, 2, 2, 2, 1, 1, 0} d2={1, 1, 1, 0, 0, 0, 1} Wednesday, December 12, 12
  • 12. Search Engine • Document Similiarity Similarity between document d1 dan d2 : S(d1, d2) S(d1, d2)=|d1-d2| Contoh: d1={1, 2, 2, 2, 1, 1, 0} d2={1, 1, 1, 0, 0, 0, 1} S(d1, d2)=|1-1|+|2-1|+|2-1|+|2-0|+|1-0|+|1-0|+|0-1| S(d1, d2)=7 With above definition, less value we got means more those two documents are getting more similiar Wednesday, December 12, 12
  • 13. Search Engine • Alghoritms 1. Counting Tf for every document 2. Find the smallest value of S(d, di) from all documents collection to get the most similiar of document d 3. if the value of S(d, di) < threshold then document d and compared with create date, then erase older document 4. Repeat process 2 dan 3 until there is no value of S that less than Theshold Wednesday, December 12, 12
  • 14. Search Engine • Document Enhancement • Give tagging based on taxonomy Wednesday, December 12, 12
  • 15. Search Engine • Document Enhancement Documents (title, Documents Document summary, Enhancing (Categorized, author, Taxonomized) datetime) Wednesday, December 12, 12
  • 16. Search Engine • Indexing • Indexing process from all information that have been gathered in one document • Faster Searching process • Able to search based on certain field Wednesday, December 12, 12
  • 17. Search Engine • Indexing Language Analyzer Documents (Categorized, Indexer Index Taxonomized) Stop Analyzer Wednesday, December 12, 12
  • 18. Search Engine • Searching Web Client Index Index Searcher Mobile Client Wednesday, December 12, 12
  • 19. Search Engine • Document Serving • Search Engine also has a function to display result Wednesday, December 12, 12
  • 20. Search Engine Web Client Index Index Document Index Searcher Searcher Landing Page Mobile Client Wednesday, December 12, 12
  • 21. Search Engine • Recommended Open Source Technology • Search Engine : Lucene, Nutch • Programming Library : Hadoop, Scala Actor • Database : MongoDB, PostgreSQL • Programming Language : Java, Scala, PHP Wednesday, December 12, 12