2. Goals
•Identify important entities within the
advertisements.
•Link them to corresponding wikipedia pages.
•Identify relevant concepts in order to
disambigute entity.
3. Benefits of Wikipedia
•Ever-expanding number of Pages in Corpus
Wikipedia
•A rigorous structure but with low coverage
which emulates real world data very well.
•Many number of entities including proper
names unlikely to be found in any other
collection.
•Redirect pages or disambiguation pages.
4. Process Overview
•Parser Module - This module parses the the
given webpage page and produces two
documents namely the Advertisments itself and
the Document which will later be used to in the
final steps to disambigute results of the search
module.
•Tokenizer Module - Converts the
advertisments into a list of tokens.
•POS Tagger Module- It is used for marking up a
word in an Ad particular part of speech
5. Process Overview
•Parsing Module – Returns advertisements in
tree format.
•Noun Phrase Extraction Module - Extract NP
from the tree generated in the previous process.
•Noun Phrase Ranking – Ranks NP using a
heuristic function.
6. Process Overview
•Entity/Keyword Extraction Module:- Probable
entity and keywords are extracted from the
highest ranked NP.
•Search Module – Returns a list of relevant
documents. The seach module is basically a
inverted index of the wiki dump. We extract only
the titles and summary of the page.
•Filtering of results – Finds out most likely/close
wiki page.
7.
8. Entity Detection
•Basic Technique for entity detection is chunk
detection via shallow parsing.
•This technique reduces the key-words to be
searched in the corpus, improving performance
and accuracy.
9. Evaluation and Results
•Advertisement: An Apple a day keeps the
doctor away Wiki Page: Apple(fruit)
•Advertisement: Apple innovates relentlessly to
make great products , buy an apple Wiki Page:
Apple Corporation
•Advertisement: Royal Stag , its your life make it
large Wiki Page: royal stag
10. Conclusions
• It is possible to use NLP techniques to narrow
down list of words to be searched in the search
engine.
•Context can be extracted from the
advertisement itslef using NLP techniques.
•The search module gives satifactory results on a
simple inverted index created using page titles
and summary.
11. References
•M. Datar, N. Immorlica, P. Indyk, and V.S. Mirrokni, “Locality-sensitive
hashing scheme based on p-stable distributions,―Symposium on
Computational Geometry pp. 253–262, 2004.
•A.Z. Broder, “On the resemblance and containment of documents,―
Proc. Compression and Complexity of Sequences, pp. 21–29, Positano Italy,
1997
•A. Andoni and P. Indyk, “Near-optimal hashing algorithms for
approximate nearest neighbor in high dimensions,―Comm. ACM
51:1, pp. 117– 122, 2008.