SlideShare uma empresa Scribd logo
1 de 30
Exploiting Web Search Engines to Search Structured Databases Author:Sanjay Agrawal, Kaushik Chakrabarti, Dong Xin Publication:WWW 2009 MADRID Presenter: Nita Pawar
1. INTRODUCTION  			2. ARCHITECTURAL OVERVIEW 			3. ENTITY EXTRACTION 			4. ENTITY RANKING 			5. EXPERIMENTAL EVALUATION 			6. CONCLUSION  OUTLINE
1.INTRODUCTION 1] Queris against search engines . 2]  Query is a structured database. 3] Each structured database is searched individually and the relevant structured data items are returned to the web search engine. The search engine gathers the structured search results and displays them along side the web search results. 4]The result from the structured database search is independent of the result from the web search this is called silo-ed search.This approach works well for some queries. 5] However, there is a broad class of queries where this approach would return incomplete or even empty results. 6] The documents returned by a web search engine are likely to mention products that are relevant to the user query. 7] In this paper the evidence each individual returned document provides about the products mentioned in it and then aggregating" those evidences across the returned documents.
8] Here we also study the task of efficiently and accurately identifying relevant information in a structured database for a web search query.
9]  In this approach can return entity results for a much wider range of queries compared to silo-ed search. 10] To implement this approach there are 2 main design main goal (i) be effective for a wide variety of structured data domains,  (ii) be integrated with a search engine and to exploit it effectively.
[object Object],1] The web documents relevant to the query would typically mention the relevant entities. Eg. Figure 1. 2. ARCHITECTURAL OVERVIEW 2.1 Key Insights
2.2 Architecture
[object Object],(i) The crawlers crawl (ii) The indexers process (iii) The searchers (iv) The front end ,[object Object],1] Entity Extraction: The entity extraction component takes a text document as input, analyzes the document and outputs all the mentions of entities from the given structured database mentioned in the document. 2] Entity Retrieval: The task of entity retrieval is to lookup the DocIndex for a given document identifier and retrieve the entities extracted from the document along with their positions.  3] Entity Ranker:  The set of all entities returned by the entity retrieval component.
3. ENTITY EXTRACTION The extraction into two steps, 3.1]Lookup driven extraction 3.2]Entity mention classification
        3.1 Lookup-Driven Extraction Definition 1. (Extraction from an input document) An extraction from a document d with respect to the reference set Є is the set {m1, ….,mn} of all candidate mentions in d of the entities in Є. Example 1. Consider the reference entity table and the documents in Figure The extraction from d1 and d2 with respect to the reference table is  {(d1, “Xbox 360”, 2), (d3,“PlayStation 3”, 1}.
 A] Approximate Match:   1]Consider the set of consumers and electronics devices. In many documents, users may just refer to the entity  i] Canon EOS Digital Rebel XTi SLR Camera       by writing anon XTI", or OS ii] XTI SLR" and to the entity Sony Vaio F150 Laptop       by writing aio F150" or ony F150".  2]Some times these documents match exactly with entity names in the    reference table would cause these product mentions to be missed.  3]So it is very important to consider approximate matches between sub-strings in a document and an entity name in a reference table
B] Memory Requirement:  We assume that the trie  over all entities in the reference set fits in main memory. e.g. The number of entities (product names) runs up to 10 million, we observed that the memory consumption of the trie is still less than 100 MB. For this e.g where the of the trie may be larger than that available, we can consider sophisticated data structures such as compressed full text indices .also we consider approximate lookup structures such as bloom filters
1] Here we consider two mention   a) true mentions.   b) false mentions. 2] E.g. the string “Pi" can refer to either the movie or the mathematical constant. Similarly, an occurrence of the string “Pretty Woman "may or may not refer to the lm of the same name. 3] Here Є is a set of entities from known entity classes such as products, actors and movies. So determining  whether a candidate mention refers an entity e it is a true mention of e, otherwise it is a false mention. 3.2 Entity Mention Classification
We now discuss two important issues for training and applying these classifiers. a] Generating training data: 1] Here e.g. Wikipedia.    (e.g., http://en.wikipedia.org/wiki/-List_of_painters_by_name). 2] If the candidate mention is linked to the Wikipedia page of the entity, then it is a true mention. If it is linked to a different page within Wikipedia, it is a false mention. b] Feature sets: We use the following 4 types of features:
1] Document Level Features:  For example, movies tend to occur in documents about entertainment, and writers in documents about literature. To implement this , we trained the classifier using the 30K n-gram features occurring most frequently in our training set. 2] Entity Context Features:  By using this feature we can increase the accuracy of our classifiers. 3] Entity Token Frequency Features:  It  is an important feature for entity mention classification. e.g. For example, the movie title `The Others' is a common English phrase, likely to appear in documents without referring to the movie of the same name. 4] Entity Token Length Features:      Short strings and strings with few words not refer to entities than longer   ones. Therefore, we include the number of words in an entity string and its length in characters as features.
The task of the entity ranker is to rank the set of entities. 1.Aggregation: The higher the number of mentions, the more relevant the entity. 2. Proximity: An entity mentioned near the query keywords is likely to be more relevant to the query than an entity mentioned. 3. Document importance: An entity mentioned in a document with higher relevance to the query and/or higher static rank is likely to be more relevant than one mentioned in a less relevant and/or lower static rank document. 4. ENTITY RANKING
 D :set of top N documents returned by  the search engine.   De :set of documents mentioning an entity e.  imp(d) :the importance of a document d, and score(d,e) : e's score from the document d. If an entity is not mentioned in any document, its score is 0 by this definition Importance of a Document imp(d):Partition the list of top N documents into ranges of equal size B. where                 each document d, we compute the range. range(d) the document d is in. range(d) is d[            ]   where rank(d) denotes the document's rank among the web search results. we done the importance imp(d) of d to be
Score from a Document score(d,e): The score score(d, e) of an entity e gets from a single document d depends on     (i) whether e is mentioned in the document d   (ii) on the proximity between the query keywords and the mentions of       e in the document d. We define the score(d, e) as follows: score(d,e)  = 0, if e is not mentioned in d                  = wa, if e occurs in snippet                  = wb, otherwise where wa and wb are two constants.  wa :The score an entity gets from a document when it occurs in the snippet,   wb: The score when it does not.
5. EXPERIMENTAL EVALUATION The results of an extensive empirical study to evaluate the approach proposed in this paper. Superiority over silo-ed search: entity search approach overcomes the limitation of silo-ed search over entity database. Effectiveness of entity ranking: ranking based on aggregation and query term proximity produces high quality results. Low overhead: lower overhead on the query time compared to other architectures proposed for entity search. Accuracy of entity extraction: accurate in identifying the true entity mentions in web pages.
1]To Compare our integrated entity search approach with silo-ed search on a product database containing names and descriptions of187,606 Consumer and Electronics products. 2]Database by storing the product information in a table in Microsoft SQL Server 2008 (one row per product). 3]Column using SQL Server Integrated Full Text Search engine (iFTS) 4]Given a keyword query, we issue the corresponding AND query on that column using iFTS and return the results. 5.1 Comparison with Silo-ed Search
The number of results returned by the two approaches for the 316 queries. Integrated entity search returned answers for most queries (for 98.4% of the queries) where as silo-ed search often returned empty results (for about 36.4% of the queries).
5.2 Entity Ranking To evaluate the quality of our entity ranking on 3 datasets: 1] TREC Question Answering benchmark: 1.factoid questions from the TREC Question Answering track for years 1999, 2000 and 2001.
2] IMDB Data: 1.For example, for the movie “The Godfather", the associated people include Francis Ford Coppola, Mario Puzo, Marlon Brando, etc. We treat the movie name as the query and the associated people as ground truth for the query. Used 250 such queries, Average16.3 answers per query. Figure  shows the precision-recall trade off for our integrated entity search technique. vary the precision-recall trade off by varying the number k of top ranked answers returned by the integrated entity search approach (K = 1, 3, 5, 10, 20 and 30).
Using the snippet (i.e., taking proximity into account) improves the quality significantly: the precision improved from 67% to 86%, from 70% to 81% and from 71% to 79% for K = 1; 3 and 5 respectively. MRR improves from 0.8 to 0.91. 3] Wikipedia Data:             Extracted 115 queries and their ground truth from Wikipedia.
Figure  shows the precision-recall tradeoff for our integrated entity search technique for various values of K (K = 1, 3,5, 10, 20 and 30). The integrated entity search approach produces high quality results for this collection as well: the precision is 61% and 42% for the top 1 and 3 answers respectively. Again, using the snippets signicantly improves quality: the precision improved from 54% to 61% and from 35% to 42% for k = 1 and 3 respectively. Furthermore, the MRR improves from 0.67 to 0.74.
5.3 Overhead of our Approach Query Time Overhead: The overhead is less than 0.1 milliseconds which is negligible compared to end-to-end query times of web search engines .We observe that even for the commerce queries on the product database where the aggregation and ranking costs are relatively high the overhead remains below 0.1 ms.
Extraction Overhead:      The extraction time consists of 3 components:    1)the time to scan and decompress the crawled documents.    2)The time to identify all candidate mentions using lookup-driven extraction    3)the time to perform entity mention classification. Measured the above times for a web sample of 30310 documents (250MB in document text). The times for each of the individual components are 18306 ms, 51495 ms and 20723 ms respectively. The total extraction time is the sum of the above three times: 90524 m.  the average extraction overhead at less than 3ms per web documents.
1]Author use a linear Support Vector Machine (SVM) trained using Sequential Minimal Optimization as the classifier. 2]Measured the accuracy of the classier with regards to differentiating true mentions from the false ones among the set of all candidates. 3]54.2% of all entity candidates in the test data are true mentions. 5.4 Accuracy of Entity Classification
1. Establishing and exploiting relationships between web search results and structured entity databases signicantly enhances the effectiveness of search on these structured databases. 2. An architecture, existing search engine components in order to efficiently implement the integrated entity search functionality. 3. Very little space and time overheads to current search engines, high quality results from the give structured databases. 6.Conclusion
THANK  YOU

Mais conteúdo relacionado

Mais procurados

International Journal of Computational Engineering Research(IJCER)
 International Journal of Computational Engineering Research(IJCER)  International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER) ijceronline
 
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...IRJET Journal
 
Email Data Cleaning
Email Data CleaningEmail Data Cleaning
Email Data Cleaningfeiwin
 
Ijarcet vol-2-issue-4-1357-1362
Ijarcet vol-2-issue-4-1357-1362Ijarcet vol-2-issue-4-1357-1362
Ijarcet vol-2-issue-4-1357-1362Editor IJARCET
 
Context-Based Diversification for Keyword Queries over XML Data
Context-Based Diversification for Keyword Queries over XML DataContext-Based Diversification for Keyword Queries over XML Data
Context-Based Diversification for Keyword Queries over XML Data1crore projects
 
Context based Document Indexing and Retrieval using Big Data Analytics - A Re...
Context based Document Indexing and Retrieval using Big Data Analytics - A Re...Context based Document Indexing and Retrieval using Big Data Analytics - A Re...
Context based Document Indexing and Retrieval using Big Data Analytics - A Re...rahulmonikasharma
 
Scalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large FoldersScalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large Foldersfeiwin
 
Automatic document clustering
Automatic document clusteringAutomatic document clustering
Automatic document clusteringIAEME Publication
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
 
Udd for multiple web databases
Udd for multiple web databasesUdd for multiple web databases
Udd for multiple web databasessabhadakwan
 
Feature Extraction for Large-Scale Text Collections
Feature Extraction for Large-Scale Text CollectionsFeature Extraction for Large-Scale Text Collections
Feature Extraction for Large-Scale Text CollectionsSease
 

Mais procurados (16)

International Journal of Computational Engineering Research(IJCER)
 International Journal of Computational Engineering Research(IJCER)  International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
 
Bi4101343346
Bi4101343346Bi4101343346
Bi4101343346
 
Ir 02
Ir   02Ir   02
Ir 02
 
Focused Crawling System based on Improved LSI
Focused Crawling System based on Improved LSIFocused Crawling System based on Improved LSI
Focused Crawling System based on Improved LSI
 
Email Data Cleaning
Email Data CleaningEmail Data Cleaning
Email Data Cleaning
 
Ijarcet vol-2-issue-4-1357-1362
Ijarcet vol-2-issue-4-1357-1362Ijarcet vol-2-issue-4-1357-1362
Ijarcet vol-2-issue-4-1357-1362
 
Context-Based Diversification for Keyword Queries over XML Data
Context-Based Diversification for Keyword Queries over XML DataContext-Based Diversification for Keyword Queries over XML Data
Context-Based Diversification for Keyword Queries over XML Data
 
LeVan, "Search Web Services"
LeVan, "Search Web Services"LeVan, "Search Web Services"
LeVan, "Search Web Services"
 
Context based Document Indexing and Retrieval using Big Data Analytics - A Re...
Context based Document Indexing and Retrieval using Big Data Analytics - A Re...Context based Document Indexing and Retrieval using Big Data Analytics - A Re...
Context based Document Indexing and Retrieval using Big Data Analytics - A Re...
 
Scalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large FoldersScalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large Folders
 
Automatic document clustering
Automatic document clusteringAutomatic document clustering
Automatic document clustering
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
Udd for multiple web databases
Udd for multiple web databasesUdd for multiple web databases
Udd for multiple web databases
 
Feature Extraction for Large-Scale Text Collections
Feature Extraction for Large-Scale Text CollectionsFeature Extraction for Large-Scale Text Collections
Feature Extraction for Large-Scale Text Collections
 
In3415791583
In3415791583In3415791583
In3415791583
 

Destaque

Budowa komputera (1)
Budowa komputera (1)Budowa komputera (1)
Budowa komputera (1)dariusz1235
 
Budowa komputera (1)
Budowa komputera (1)Budowa komputera (1)
Budowa komputera (1)dariusz1235
 
Budowa komputera
Budowa komputera Budowa komputera
Budowa komputera dariusz1235
 
Decoding A Culture: Beginning Hang Gliders
Decoding A Culture: Beginning Hang GlidersDecoding A Culture: Beginning Hang Gliders
Decoding A Culture: Beginning Hang GlidersAlexis Cristine Walden
 
Nghien cuu-chuan-ket-noi-khong-day-zigbeeieee-80215
Nghien cuu-chuan-ket-noi-khong-day-zigbeeieee-80215Nghien cuu-chuan-ket-noi-khong-day-zigbeeieee-80215
Nghien cuu-chuan-ket-noi-khong-day-zigbeeieee-80215Trần Danh Nam
 
Health & safety presentation
Health & safety presentationHealth & safety presentation
Health & safety presentationPhalinng
 
Удаленные юзабилити-тестирования
Удаленные юзабилити-тестированияУдаленные юзабилити-тестирования
Удаленные юзабилити-тестированияMaria Vasilieva
 
Success Starts With Articles
Success Starts With ArticlesSuccess Starts With Articles
Success Starts With ArticlesBryan Johnson
 
2 F Test
2 F Test2 F Test
2 F Test2FRESH
 

Destaque (17)

Budowa komputera (1)
Budowa komputera (1)Budowa komputera (1)
Budowa komputera (1)
 
Teaching 1
Teaching 1Teaching 1
Teaching 1
 
I Love Africa
I Love  AfricaI Love  Africa
I Love Africa
 
流動的世界 Part 2
流動的世界 Part 2流動的世界 Part 2
流動的世界 Part 2
 
11 litopad
11 litopad11 litopad
11 litopad
 
Budowa komputera (1)
Budowa komputera (1)Budowa komputera (1)
Budowa komputera (1)
 
流動的世界 part1
流動的世界 part1流動的世界 part1
流動的世界 part1
 
Budowa komputera
Budowa komputera Budowa komputera
Budowa komputera
 
Decoding A Culture: Beginning Hang Gliders
Decoding A Culture: Beginning Hang GlidersDecoding A Culture: Beginning Hang Gliders
Decoding A Culture: Beginning Hang Gliders
 
Nghien cuu-chuan-ket-noi-khong-day-zigbeeieee-80215
Nghien cuu-chuan-ket-noi-khong-day-zigbeeieee-80215Nghien cuu-chuan-ket-noi-khong-day-zigbeeieee-80215
Nghien cuu-chuan-ket-noi-khong-day-zigbeeieee-80215
 
Teaching 1
Teaching 1Teaching 1
Teaching 1
 
Health & safety presentation
Health & safety presentationHealth & safety presentation
Health & safety presentation
 
Удаленные юзабилити-тестирования
Удаленные юзабилити-тестированияУдаленные юзабилити-тестирования
Удаленные юзабилити-тестирования
 
Success Starts With Articles
Success Starts With ArticlesSuccess Starts With Articles
Success Starts With Articles
 
Hr Professional
Hr ProfessionalHr Professional
Hr Professional
 
Photo Bar
Photo BarPhoto Bar
Photo Bar
 
2 F Test
2 F Test2 F Test
2 F Test
 

Semelhante a Exploiting web search engines to search structured

IRJET- On-AIR Based Information Retrieval System for Semi-Structure Data
IRJET-  	  On-AIR Based Information Retrieval System for Semi-Structure DataIRJET-  	  On-AIR Based Information Retrieval System for Semi-Structure Data
IRJET- On-AIR Based Information Retrieval System for Semi-Structure DataIRJET Journal
 
Presentation dual inversion-index
Presentation dual inversion-indexPresentation dual inversion-index
Presentation dual inversion-indexmahi_uta
 
software_engg-chap-03.ppt
software_engg-chap-03.pptsoftware_engg-chap-03.ppt
software_engg-chap-03.ppt064ChetanWani
 
VRE Cancer Imaging BL RIC Workshop 22032011
VRE Cancer Imaging BL RIC Workshop 22032011VRE Cancer Imaging BL RIC Workshop 22032011
VRE Cancer Imaging BL RIC Workshop 22032011djmichael156
 
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...Computer Science Journals
 
IST365 - Project Deliverable #3Create the corresponding relation.docx
IST365 - Project Deliverable #3Create the corresponding relation.docxIST365 - Project Deliverable #3Create the corresponding relation.docx
IST365 - Project Deliverable #3Create the corresponding relation.docxpriestmanmable
 
IRJET- Review on Information Retrieval for Desktop Search Engine
IRJET-  	  Review on Information Retrieval for Desktop Search EngineIRJET-  	  Review on Information Retrieval for Desktop Search Engine
IRJET- Review on Information Retrieval for Desktop Search EngineIRJET Journal
 
Car Showroom management System in c++
Car Showroom management System in c++Car Showroom management System in c++
Car Showroom management System in c++PUBLIVE
 
IRJET- Foster Hashtag from Image and Text
IRJET-  	  Foster Hashtag from Image and TextIRJET-  	  Foster Hashtag from Image and Text
IRJET- Foster Hashtag from Image and TextIRJET Journal
 
Murach': HOW TO DEVELOP A DATA DRIVEN MVC WEB
Murach': HOW TO DEVELOP A DATA DRIVEN MVC WEB Murach': HOW TO DEVELOP A DATA DRIVEN MVC WEB
Murach': HOW TO DEVELOP A DATA DRIVEN MVC WEB MahmoudOHassouna
 
Team G
Team GTeam G
Team Gbutest
 
META SEARCH ENGINE WITH AN INTELLIGENT INTERFACE FOR INFORMATION RETRIEVAL ON...
META SEARCH ENGINE WITH AN INTELLIGENT INTERFACE FOR INFORMATION RETRIEVAL ON...META SEARCH ENGINE WITH AN INTELLIGENT INTERFACE FOR INFORMATION RETRIEVAL ON...
META SEARCH ENGINE WITH AN INTELLIGENT INTERFACE FOR INFORMATION RETRIEVAL ON...ijcseit
 
IRJET- Automatic Database Schema Generator
IRJET- Automatic Database Schema GeneratorIRJET- Automatic Database Schema Generator
IRJET- Automatic Database Schema GeneratorIRJET Journal
 
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State UniversityLSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State Universitydhabalia
 
Safeguarding Abila: Discovering Evolving Activist Networks
Safeguarding Abila: Discovering Evolving Activist NetworksSafeguarding Abila: Discovering Evolving Activist Networks
Safeguarding Abila: Discovering Evolving Activist NetworksParang Saraf
 
2.business object repository
2.business object repository2.business object repository
2.business object repositoryAjay Kumar ☁
 

Semelhante a Exploiting web search engines to search structured (20)

IRJET- On-AIR Based Information Retrieval System for Semi-Structure Data
IRJET-  	  On-AIR Based Information Retrieval System for Semi-Structure DataIRJET-  	  On-AIR Based Information Retrieval System for Semi-Structure Data
IRJET- On-AIR Based Information Retrieval System for Semi-Structure Data
 
Presentation dual inversion-index
Presentation dual inversion-indexPresentation dual inversion-index
Presentation dual inversion-index
 
Web clustring engine
Web clustring engineWeb clustring engine
Web clustring engine
 
software_engg-chap-03.ppt
software_engg-chap-03.pptsoftware_engg-chap-03.ppt
software_engg-chap-03.ppt
 
VRE Cancer Imaging BL RIC Workshop 22032011
VRE Cancer Imaging BL RIC Workshop 22032011VRE Cancer Imaging BL RIC Workshop 22032011
VRE Cancer Imaging BL RIC Workshop 22032011
 
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
 
IST365 - Project Deliverable #3Create the corresponding relation.docx
IST365 - Project Deliverable #3Create the corresponding relation.docxIST365 - Project Deliverable #3Create the corresponding relation.docx
IST365 - Project Deliverable #3Create the corresponding relation.docx
 
IRJET- Review on Information Retrieval for Desktop Search Engine
IRJET-  	  Review on Information Retrieval for Desktop Search EngineIRJET-  	  Review on Information Retrieval for Desktop Search Engine
IRJET- Review on Information Retrieval for Desktop Search Engine
 
F04302053057
F04302053057F04302053057
F04302053057
 
E017252831
E017252831E017252831
E017252831
 
Car Showroom management System in c++
Car Showroom management System in c++Car Showroom management System in c++
Car Showroom management System in c++
 
uml.pptx
uml.pptxuml.pptx
uml.pptx
 
IRJET- Foster Hashtag from Image and Text
IRJET-  	  Foster Hashtag from Image and TextIRJET-  	  Foster Hashtag from Image and Text
IRJET- Foster Hashtag from Image and Text
 
Murach': HOW TO DEVELOP A DATA DRIVEN MVC WEB
Murach': HOW TO DEVELOP A DATA DRIVEN MVC WEB Murach': HOW TO DEVELOP A DATA DRIVEN MVC WEB
Murach': HOW TO DEVELOP A DATA DRIVEN MVC WEB
 
Team G
Team GTeam G
Team G
 
META SEARCH ENGINE WITH AN INTELLIGENT INTERFACE FOR INFORMATION RETRIEVAL ON...
META SEARCH ENGINE WITH AN INTELLIGENT INTERFACE FOR INFORMATION RETRIEVAL ON...META SEARCH ENGINE WITH AN INTELLIGENT INTERFACE FOR INFORMATION RETRIEVAL ON...
META SEARCH ENGINE WITH AN INTELLIGENT INTERFACE FOR INFORMATION RETRIEVAL ON...
 
IRJET- Automatic Database Schema Generator
IRJET- Automatic Database Schema GeneratorIRJET- Automatic Database Schema Generator
IRJET- Automatic Database Schema Generator
 
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State UniversityLSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
 
Safeguarding Abila: Discovering Evolving Activist Networks
Safeguarding Abila: Discovering Evolving Activist NetworksSafeguarding Abila: Discovering Evolving Activist Networks
Safeguarding Abila: Discovering Evolving Activist Networks
 
2.business object repository
2.business object repository2.business object repository
2.business object repository
 

Último

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 

Último (20)

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

Exploiting web search engines to search structured

  • 1. Exploiting Web Search Engines to Search Structured Databases Author:Sanjay Agrawal, Kaushik Chakrabarti, Dong Xin Publication:WWW 2009 MADRID Presenter: Nita Pawar
  • 2. 1. INTRODUCTION 2. ARCHITECTURAL OVERVIEW 3. ENTITY EXTRACTION 4. ENTITY RANKING 5. EXPERIMENTAL EVALUATION 6. CONCLUSION OUTLINE
  • 3. 1.INTRODUCTION 1] Queris against search engines . 2] Query is a structured database. 3] Each structured database is searched individually and the relevant structured data items are returned to the web search engine. The search engine gathers the structured search results and displays them along side the web search results. 4]The result from the structured database search is independent of the result from the web search this is called silo-ed search.This approach works well for some queries. 5] However, there is a broad class of queries where this approach would return incomplete or even empty results. 6] The documents returned by a web search engine are likely to mention products that are relevant to the user query. 7] In this paper the evidence each individual returned document provides about the products mentioned in it and then aggregating" those evidences across the returned documents.
  • 4. 8] Here we also study the task of efficiently and accurately identifying relevant information in a structured database for a web search query.
  • 5. 9] In this approach can return entity results for a much wider range of queries compared to silo-ed search. 10] To implement this approach there are 2 main design main goal (i) be effective for a wide variety of structured data domains, (ii) be integrated with a search engine and to exploit it effectively.
  • 6.
  • 8.
  • 9. 3. ENTITY EXTRACTION The extraction into two steps, 3.1]Lookup driven extraction 3.2]Entity mention classification
  • 10. 3.1 Lookup-Driven Extraction Definition 1. (Extraction from an input document) An extraction from a document d with respect to the reference set Є is the set {m1, ….,mn} of all candidate mentions in d of the entities in Є. Example 1. Consider the reference entity table and the documents in Figure The extraction from d1 and d2 with respect to the reference table is {(d1, “Xbox 360”, 2), (d3,“PlayStation 3”, 1}.
  • 11. A] Approximate Match: 1]Consider the set of consumers and electronics devices. In many documents, users may just refer to the entity i] Canon EOS Digital Rebel XTi SLR Camera by writing anon XTI", or OS ii] XTI SLR" and to the entity Sony Vaio F150 Laptop by writing aio F150" or ony F150". 2]Some times these documents match exactly with entity names in the reference table would cause these product mentions to be missed. 3]So it is very important to consider approximate matches between sub-strings in a document and an entity name in a reference table
  • 12. B] Memory Requirement: We assume that the trie over all entities in the reference set fits in main memory. e.g. The number of entities (product names) runs up to 10 million, we observed that the memory consumption of the trie is still less than 100 MB. For this e.g where the of the trie may be larger than that available, we can consider sophisticated data structures such as compressed full text indices .also we consider approximate lookup structures such as bloom filters
  • 13. 1] Here we consider two mention a) true mentions. b) false mentions. 2] E.g. the string “Pi" can refer to either the movie or the mathematical constant. Similarly, an occurrence of the string “Pretty Woman "may or may not refer to the lm of the same name. 3] Here Є is a set of entities from known entity classes such as products, actors and movies. So determining whether a candidate mention refers an entity e it is a true mention of e, otherwise it is a false mention. 3.2 Entity Mention Classification
  • 14. We now discuss two important issues for training and applying these classifiers. a] Generating training data: 1] Here e.g. Wikipedia. (e.g., http://en.wikipedia.org/wiki/-List_of_painters_by_name). 2] If the candidate mention is linked to the Wikipedia page of the entity, then it is a true mention. If it is linked to a different page within Wikipedia, it is a false mention. b] Feature sets: We use the following 4 types of features:
  • 15. 1] Document Level Features: For example, movies tend to occur in documents about entertainment, and writers in documents about literature. To implement this , we trained the classifier using the 30K n-gram features occurring most frequently in our training set. 2] Entity Context Features: By using this feature we can increase the accuracy of our classifiers. 3] Entity Token Frequency Features: It is an important feature for entity mention classification. e.g. For example, the movie title `The Others' is a common English phrase, likely to appear in documents without referring to the movie of the same name. 4] Entity Token Length Features: Short strings and strings with few words not refer to entities than longer ones. Therefore, we include the number of words in an entity string and its length in characters as features.
  • 16. The task of the entity ranker is to rank the set of entities. 1.Aggregation: The higher the number of mentions, the more relevant the entity. 2. Proximity: An entity mentioned near the query keywords is likely to be more relevant to the query than an entity mentioned. 3. Document importance: An entity mentioned in a document with higher relevance to the query and/or higher static rank is likely to be more relevant than one mentioned in a less relevant and/or lower static rank document. 4. ENTITY RANKING
  • 17. D :set of top N documents returned by the search engine. De :set of documents mentioning an entity e. imp(d) :the importance of a document d, and score(d,e) : e's score from the document d. If an entity is not mentioned in any document, its score is 0 by this definition Importance of a Document imp(d):Partition the list of top N documents into ranges of equal size B. where each document d, we compute the range. range(d) the document d is in. range(d) is d[ ] where rank(d) denotes the document's rank among the web search results. we done the importance imp(d) of d to be
  • 18. Score from a Document score(d,e): The score score(d, e) of an entity e gets from a single document d depends on (i) whether e is mentioned in the document d (ii) on the proximity between the query keywords and the mentions of e in the document d. We define the score(d, e) as follows: score(d,e) = 0, if e is not mentioned in d = wa, if e occurs in snippet = wb, otherwise where wa and wb are two constants. wa :The score an entity gets from a document when it occurs in the snippet, wb: The score when it does not.
  • 19. 5. EXPERIMENTAL EVALUATION The results of an extensive empirical study to evaluate the approach proposed in this paper. Superiority over silo-ed search: entity search approach overcomes the limitation of silo-ed search over entity database. Effectiveness of entity ranking: ranking based on aggregation and query term proximity produces high quality results. Low overhead: lower overhead on the query time compared to other architectures proposed for entity search. Accuracy of entity extraction: accurate in identifying the true entity mentions in web pages.
  • 20. 1]To Compare our integrated entity search approach with silo-ed search on a product database containing names and descriptions of187,606 Consumer and Electronics products. 2]Database by storing the product information in a table in Microsoft SQL Server 2008 (one row per product). 3]Column using SQL Server Integrated Full Text Search engine (iFTS) 4]Given a keyword query, we issue the corresponding AND query on that column using iFTS and return the results. 5.1 Comparison with Silo-ed Search
  • 21. The number of results returned by the two approaches for the 316 queries. Integrated entity search returned answers for most queries (for 98.4% of the queries) where as silo-ed search often returned empty results (for about 36.4% of the queries).
  • 22. 5.2 Entity Ranking To evaluate the quality of our entity ranking on 3 datasets: 1] TREC Question Answering benchmark: 1.factoid questions from the TREC Question Answering track for years 1999, 2000 and 2001.
  • 23. 2] IMDB Data: 1.For example, for the movie “The Godfather", the associated people include Francis Ford Coppola, Mario Puzo, Marlon Brando, etc. We treat the movie name as the query and the associated people as ground truth for the query. Used 250 such queries, Average16.3 answers per query. Figure shows the precision-recall trade off for our integrated entity search technique. vary the precision-recall trade off by varying the number k of top ranked answers returned by the integrated entity search approach (K = 1, 3, 5, 10, 20 and 30).
  • 24. Using the snippet (i.e., taking proximity into account) improves the quality significantly: the precision improved from 67% to 86%, from 70% to 81% and from 71% to 79% for K = 1; 3 and 5 respectively. MRR improves from 0.8 to 0.91. 3] Wikipedia Data: Extracted 115 queries and their ground truth from Wikipedia.
  • 25. Figure shows the precision-recall tradeoff for our integrated entity search technique for various values of K (K = 1, 3,5, 10, 20 and 30). The integrated entity search approach produces high quality results for this collection as well: the precision is 61% and 42% for the top 1 and 3 answers respectively. Again, using the snippets signicantly improves quality: the precision improved from 54% to 61% and from 35% to 42% for k = 1 and 3 respectively. Furthermore, the MRR improves from 0.67 to 0.74.
  • 26. 5.3 Overhead of our Approach Query Time Overhead: The overhead is less than 0.1 milliseconds which is negligible compared to end-to-end query times of web search engines .We observe that even for the commerce queries on the product database where the aggregation and ranking costs are relatively high the overhead remains below 0.1 ms.
  • 27. Extraction Overhead: The extraction time consists of 3 components: 1)the time to scan and decompress the crawled documents. 2)The time to identify all candidate mentions using lookup-driven extraction 3)the time to perform entity mention classification. Measured the above times for a web sample of 30310 documents (250MB in document text). The times for each of the individual components are 18306 ms, 51495 ms and 20723 ms respectively. The total extraction time is the sum of the above three times: 90524 m. the average extraction overhead at less than 3ms per web documents.
  • 28. 1]Author use a linear Support Vector Machine (SVM) trained using Sequential Minimal Optimization as the classifier. 2]Measured the accuracy of the classier with regards to differentiating true mentions from the false ones among the set of all candidates. 3]54.2% of all entity candidates in the test data are true mentions. 5.4 Accuracy of Entity Classification
  • 29. 1. Establishing and exploiting relationships between web search results and structured entity databases signicantly enhances the effectiveness of search on these structured databases. 2. An architecture, existing search engine components in order to efficiently implement the integrated entity search functionality. 3. Very little space and time overheads to current search engines, high quality results from the give structured databases. 6.Conclusion