SlideShare uma empresa Scribd logo
1 de 49
Baixar para ler offline
CONCEPTS AND CHALLENGES
OFTEXT RETRIEVAL
FOR SEARCH ENGINES
PRE CONFERENCETUTORIAL
by Gan Keng Hoon
16th August 2016
1
THISTUTORIAL
Overview:Text Retrieval & Search Engine
Concept : Basics ofText Retrieval
Challenges: Semantics & Specific
Case: Expert Search Engine
2
Search
3
What Do People Search for?
FuYuanhui
How to get free Pokeball ?
How to write thesis in three month ?
keynote speaker ICAICTA 2016
4
What Do People Expect ?
How to get free Pokeball
5
Behind the
Click?
6
Quiz:Which one is not a Search Engine?
7
Type of Search Engine
Web Search Engine
Google,Yahoo, Bing
Domain Specific Search Engine
Medline/Pubmed
Microsoft Academic
Desktop Search Engine
Copernic
8
ConnectingTwo Ends
Search
Collection
 Web
 Domain
Specific
 Personal
 Enterprise
Etc.
Information
Needs
I want to know more
about the keynotes
speech of ICAICTA
2016.
I need more
Pokeballs
Free Of
Charge..…
What’s so funny
about FuYuan
Hui??
Scholarship
ending soon,
three months
left to submit
my thesis….
 Web Sites
 Journal
Articles
 News
 Images
 Videos
 Audio
 Scanned
Documents
 Tweets
 Posts
 Reviews
 Etc…
9
A Conceptual Model forText Retrieval
Information Needs
Query
Search Collection
Document
Representation
Retrieved
Documents
Indexing
Formulation
Retrieval Function
Relevance Feedback
Natural Language
Content Analysis
10
Natural Language Content Analysis
11
SearchCollection (Retrieval Unit)
Web pages, email, books, news stories, scholarly
papers, text messages,Word™, Powerpoint™, PDF,
forum postings, patents, etc.
Retrieval unit can be
Part of document, e.g. a paragraph, a slide, a page etc.
In the form different structure, html, xml, text etc.
In different sizes/length.
12
Document Representation
FullText Representation
Keep everything. Complete.
Require huge resources.Too much may not be good.
Reduced (partial) Content Representation
Remove not important contents e.g. stopwords.
Standardization to reduce overlapped contents e.g. stemming.
Retain only important contents, e.g. noun phrases, header etc.
13
Document Representation
Think of representation as some ways of storing the document.
Bag of Words Model
Store the words as the bag (multiset) of its words,
disregarding grammar and even word order.
Document 1: "The cat sat on the hat"
Document 2: "The dog ate the cat and the hat"
From these two documents, a word list is constructed:
{ the, cat, sat, on, hat, dog, ate, and }
The list has 8 distinct words.
Document 1: { 2, 1, 1, 1, 1, 0, 0, 0 }
Document 2 : { 3, 1, 0, 0, 1, 1, 1, 1}
14
Information Needs & Query
Information Needs != Query
Recall the information needs
Query: icaicta 2016 keynote
Information Need: I want to know more about the keynotes speech of
ICAICTA 2016
Query: free pokeball
Information Need: I need more Pokeballs. I don’t want to pay. No cheat
codes.
15
Retrieved Documents
From the original collection, a subset of documents are obtained.
What is the factor that determines what document to return?
SimpleTerm Matching Approach
1. Compare the terms in a document and query.
2. Compute “similarity” between each document in the collection and
the query based on the terms they have in common.
3. Sorting the document in order of decreasing similarity with the
query.
4. The outputs are a ranked list and displayed to the user - the top ones
are more relevant as judged by the system.
16
Indexing
Convert documents into
representation or data structure to
improve the efficiency of retrieval.
To generate a set of useful terms
called indexes.
Why?
Many variety of words used in texts,
but not all are important.
Among the important words, some
are more contextually relevant.
Some basic processes
involved
•Tokenization
•StopWords Removal
•Stemming
•Phrases
•Inverted File
17
Indexing (Tokenization)
Convert a sequence of characters
into a sequence of tokens with
some basic meaning.
“The cat chases the mouse.”
“Bigcorp's 2007 bi-annual report
showed profits rose 10%.”
the
cat
chases
the
mouse
bigcorp
2007
bi
annual
report
showed
profits
rose
10%
18
Indexing (Tokenization)
Token can be single or multiple terms.
“Samsung Galaxy S7 Edge, redefines what a phone can do.”
samsung galaxy s7 edge
redefines
what
a
phone
can
do
samsung
galaxy
s7
edge
redefines
what
a ….
or
19
Indexing (Tokenization)
Common Issues
1. Capitalized words can have different meaning from lower case words
Bush fires the officer. Query: Bush fire
The bush fire lasted for 3 days. Query: bush fire
2. Apostrophes can be a part of a word, a part of a possessive, or just a
mistake
rosie o'donnell, can't, don't, 80's, 1890's, men's straw hats, master's
degree, england's ten largest cities, shriner's
20
Indexing (Tokenization)
3. Numbers can be important, including decimals
nokia 3250, top 10 courses, united 93, quicktime 6.5 pro, 92.3 the
beat, 288358
4. Periods can occur in numbers, abbreviations, URLs, ends of
sentences, and other situations
I.B.M., Ph.D., cs.umass.edu, F.E.A.R.
Note: tokenizing steps for queries must be identical to steps for
documents
21
Indexing (Stopping)
Top 50 Words from AP89 News
Collection
Recall,
Indexes should be useful term links
to a document.
Are the terms on the right figure
useful?
22
Indexing (Stopping)
Stopword list can be created from high-frequency words or based
on a standard list
Lists are customized for applications, domains, and even parts of
documents
e.g., “click” is a good stopword for anchor text
Best policy is to index all words in documents, make decisions
about which words to use at query time?
23
Indexing (Stemming)
Many morphological variations of words
inflectional (plurals, tenses)
derivational (making verbs nouns etc.)
In most cases, these have the same or very similar meanings
Stemmers attempt to reduce morphological variations of words
to a common stem
usually involves removing suffixes
Can be done at indexing time or as part of query processing (like
stopwords)
24
Indexing (Stemming)
Porter Stemmer
Algorithmic stemmer used in
IR experiments since the 70s
Consists of a series of rules
designed to the longest
possible suffix at each step
Produces stems not words
Example Step 1 (right figure)
25
Indexing (Phrases)
Recall, token, meaningful tokens are better indexes, e.g.
phrases.
Text processing issue – how are phrases recognized?
Three possible approaches:
Identify syntactic phrases using a part-of-speech (POS) tagger
Use word n-grams
Store word positions in indexes and use proximity operators in
queries
26
Indexing (Phrases)
Example Noun Phrases
* Other method like N-Gram
27
Indexing (Inverted Index)
Recall, indexes are designed to support search.
Each index term is associated with an inverted list
Contains lists of documents, or lists of word occurrences in documents, and
other information.
Each entry is called a posting.
The part of the posting that refers to a specific document or location
is called a pointer
Each document in the collection is given a unique number
Lists are usually document-ordered (sorted by document number)
28
Indexing (Inverted Index)
Sample collection. 4 sentences fromWikipedia entry for Tropical
Fish
29
Indexing (Inverted Index)
Simple inverted index.
30
Indexing (Inverted Index)
Inverted index with
counts.
Support better
ranking algorithms.
31
Indexing
(Inverted Index)
Inverted index with
positions.
Support proximity
matching.
32
Retrieval Function
Ranking
Documents are retrieved in sorted order according to a score
computing using the document representation, the query, and a
ranking algorithm
33
Retrieval Function (Vector Space Model)
Ranked based method.
Documents and query represented by a vector of term
weights.
Collection represented by a matrix of term weights.
34
Retrieval Function (Vector Space Model)
borneo daily new north straits times
D1 0 0 1 0 1 1
D2 0 1 1 0 1 0
D3 1 0 0 1 0 1
D1: new straits times
D2: new straits daily
D3 : north borneo times
Vector of useful terms
35
Retrieval Function (Vector Space Model)
borneo daily new north straits times
D1 0 0 0.176 0 0.176 0.176
D2 0 0.477 0.176 0 0.176 0
D3 0.477 0 0 0.477 0 0.176
idf (borneo) = log(3/1) =0.477
idf (daily) = log(3/1) = 0.477
idf (new) = log(3/2) =0.176
idf (north) = log(3/1) = 0.477
idf (straits) = log(3/2) = 0.176
idf (times) = log(3/2) = 0.176
then multiply by tf
tf.idf weight
Term frequency weight measures
importance in document:
Inverse document frequency measures
importance in collection:
Note: Doc Length,Term Location,Term Semantic Meaning
36
Retrieval Function (Vector Space Model)
Documents ranked by distance between points
representing query and documents
Similarity measure more common than a distance or dissimilarity
measure
e.g. Cosine correlation
37
Retrieval Function (Vector Space Model)
Consider two documents D1, D2 and a query Q
Q = “straits times”
Compare against collection, D1 = “new straits times”
(borneo, daily, new, north, straits, times)
Q = (0, 0, 0, 0, 0.176, 0.176)
D1 = (0, 0, 0.176, 0, 0.176, 0.176)
D2 = (0, 0.477, 0.176, 0, 0.176, 0)
𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐷𝐷𝐷, 𝑄𝑄 =
0∗0 + 0∗0 + 0.176∗0 + 0∗0 + 0.176∗0.176 +(0.176∗0.176)
0.1762
+0.1762
+0.1762
(0.1762
+0.1762
)
=0.816
Find Cosine (D2,Q).
Which document is
more relevant?
38
Evaluation
A must to evaluate the retrieval function, preprocessing
steps etc.
StandardCollection
Task specific
Human experts are used to judge relevant results.
Performance Metric
Precision
Recall
39
Evaluation (Collection)
Test collections consisting of documents, queries, and relevance
judgments, e.g.,
40
Evaluation (Collection)
Example query and
narrative for golden
standard.
41
Evaluation (Effectiveness Measures)
A is set of relevant documents,
B is set of retrieved documents
42
Evaluation (Ranking Effectiveness)
43
Evaluation (Ranking Effectiveness)
Recall@4 = 3/4
Precision@4 = 3/4
Recall@2 = 2/4
Precision@2 = 2/2 44
Challenges
SocialTexts,
e.g.Tweets,
Posts
Hard question.
Hard Disk ?
Named Entity 
Various levels and
aspects of
annotations
45
Challenges
Small Data
Specific search
Improve semantics extensively
Big Data
Multi modal retrieval
Connecting many medias
46
Case: Adding Semantics Bibliography
Improve Search Results Display
Facet-based
semantic
UsefulTerms
Demo: ir.cs.usm.my
THANKYOU
khgan@usm.my
49

Mais conteúdo relacionado

Mais procurados

Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)
Trey Grainger
 
SA2: Text Mining from User Generated Content
SA2: Text Mining from User Generated ContentSA2: Text Mining from User Generated Content
SA2: Text Mining from User Generated Content
John Breslin
 
Text data mining1
Text data mining1Text data mining1
Text data mining1
KU Leuven
 
Information extraction for Free Text
Information extraction for Free TextInformation extraction for Free Text
Information extraction for Free Text
butest
 

Mais procurados (20)

Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
Text Mining Analytics 101
Text Mining Analytics 101Text Mining Analytics 101
Text Mining Analytics 101
 
Text mining
Text miningText mining
Text mining
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
 
Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...
 
Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)
 
Measuring Relevance in the Negative Space
Measuring Relevance in the Negative SpaceMeasuring Relevance in the Negative Space
Measuring Relevance in the Negative Space
 
SA2: Text Mining from User Generated Content
SA2: Text Mining from User Generated ContentSA2: Text Mining from User Generated Content
SA2: Text Mining from User Generated Content
 
Searching for Meaning
Searching for MeaningSearching for Meaning
Searching for Meaning
 
The Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation EnginesThe Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation Engines
 
Text data mining1
Text data mining1Text data mining1
Text data mining1
 
Information extraction for Free Text
Information extraction for Free TextInformation extraction for Free Text
Information extraction for Free Text
 
Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...
Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...
Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...
 
The Next Generation of AI-powered Search
The Next Generation of AI-powered SearchThe Next Generation of AI-powered Search
The Next Generation of AI-powered Search
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Topic Modeling - NLP
Topic Modeling - NLPTopic Modeling - NLP
Topic Modeling - NLP
 
Text Mining
Text MiningText Mining
Text Mining
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: Summarization
 

Semelhante a Concepts and Challenges of Text Retrieval for Search Engine

Chapter 1 Introduction to ISR (1).pdf
Chapter 1 Introduction to ISR (1).pdfChapter 1 Introduction to ISR (1).pdf
Chapter 1 Introduction to ISR (1).pdf
JemalNesre1
 
Outline InstructionsHere is the template that should help an.docx
Outline InstructionsHere is the template that should help an.docxOutline InstructionsHere is the template that should help an.docx
Outline InstructionsHere is the template that should help an.docx
alfred4lewis58146
 

Semelhante a Concepts and Challenges of Text Retrieval for Search Engine (20)

Ir 02
Ir   02Ir   02
Ir 02
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
Chapter 1 Introduction to ISR (1).pdf
Chapter 1 Introduction to ISR (1).pdfChapter 1 Introduction to ISR (1).pdf
Chapter 1 Introduction to ISR (1).pdf
 
Query expansion for search improvement by faizulhaque
Query expansion for search improvement by faizulhaque Query expansion for search improvement by faizulhaque
Query expansion for search improvement by faizulhaque
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modelling
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
 
Ir 03
Ir   03Ir   03
Ir 03
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Topic detecton by clustering and text mining
Topic detecton by clustering and text miningTopic detecton by clustering and text mining
Topic detecton by clustering and text mining
 
Cs583 info-retrieval
Cs583 info-retrievalCs583 info-retrieval
Cs583 info-retrieval
 
Interview_Search_Process (1).pptx
Interview_Search_Process (1).pptxInterview_Search_Process (1).pptx
Interview_Search_Process (1).pptx
 
Information Retrieval
Information Retrieval Information Retrieval
Information Retrieval
 
Outline InstructionsHere is the template that should help an.docx
Outline InstructionsHere is the template that should help an.docxOutline InstructionsHere is the template that should help an.docx
Outline InstructionsHere is the template that should help an.docx
 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_Habib
 
The impact of domain-specific stop-word lists on ecommerce website search per...
The impact of domain-specific stop-word lists on ecommerce website search per...The impact of domain-specific stop-word lists on ecommerce website search per...
The impact of domain-specific stop-word lists on ecommerce website search per...
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
A combination of reduction and expansion approaches to handle with long natur...
A combination of reduction and expansion approaches to handle with long natur...A combination of reduction and expansion approaches to handle with long natur...
A combination of reduction and expansion approaches to handle with long natur...
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
Construction of Keyword Extraction using Statistical Approaches and Document ...
Construction of Keyword Extraction using Statistical Approaches and Document ...Construction of Keyword Extraction using Statistical Approaches and Document ...
Construction of Keyword Extraction using Statistical Approaches and Document ...
 
Construction of Keyword Extraction using Statistical Approaches and Document ...
Construction of Keyword Extraction using Statistical Approaches and Document ...Construction of Keyword Extraction using Statistical Approaches and Document ...
Construction of Keyword Extraction using Statistical Approaches and Document ...
 

Mais de Gan Keng Hoon

Wi 2015 demo_preview
Wi 2015 demo_previewWi 2015 demo_preview
Wi 2015 demo_preview
Gan Keng Hoon
 

Mais de Gan Keng Hoon (16)

A View of Text Analytics from Word, Sentence and Document Levels
A View of Text Analytics from Word, Sentence and Document Levels A View of Text Analytics from Word, Sentence and Document Levels
A View of Text Analytics from Word, Sentence and Document Levels
 
Keywords Discovery with Simple Text Mining using R
Keywords Discovery with Simple Text Mining using RKeywords Discovery with Simple Text Mining using R
Keywords Discovery with Simple Text Mining using R
 
OSS 2020 Using SOLR as Open-Source Search Platform.pdf
OSS 2020 Using SOLR as Open-Source Search Platform.pdfOSS 2020 Using SOLR as Open-Source Search Platform.pdf
OSS 2020 Using SOLR as Open-Source Search Platform.pdf
 
Procrastination and Phd.pdf
Procrastination and Phd.pdfProcrastination and Phd.pdf
Procrastination and Phd.pdf
 
Guest Lecture for Principles of Data Analytics.pdf
Guest Lecture for Principles of Data Analytics.pdfGuest Lecture for Principles of Data Analytics.pdf
Guest Lecture for Principles of Data Analytics.pdf
 
Knowledge Representation Reasoning and Acquisition.pdf
Knowledge Representation Reasoning and Acquisition.pdfKnowledge Representation Reasoning and Acquisition.pdf
Knowledge Representation Reasoning and Acquisition.pdf
 
Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...
Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...
Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...
 
Interfacing Chatbot with Data Retrieval and Analytics Queries for Decision Ma...
Interfacing Chatbot with Data Retrieval and Analytics Queries for Decision Ma...Interfacing Chatbot with Data Retrieval and Analytics Queries for Decision Ma...
Interfacing Chatbot with Data Retrieval and Analytics Queries for Decision Ma...
 
Text and Sentiment Analytics for Business Intelligence
Text and Sentiment Analytics for Business IntelligenceText and Sentiment Analytics for Business Intelligence
Text and Sentiment Analytics for Business Intelligence
 
Category & Training Texts Selection for Scientific Article Categorization in ...
Category & Training Texts Selection for Scientific Article Categorization in ...Category & Training Texts Selection for Scientific Article Categorization in ...
Category & Training Texts Selection for Scientific Article Categorization in ...
 
Semantics in Retrieval
Semantics in Retrieval Semantics in Retrieval
Semantics in Retrieval
 
Faceted Search for Finding Expertise Bibliographies
Faceted Search for Finding Expertise BibliographiesFaceted Search for Finding Expertise Bibliographies
Faceted Search for Finding Expertise Bibliographies
 
ACIS 2015 Bibliographical-based Facets for Expertise Search
ACIS 2015 Bibliographical-based Facets for Expertise SearchACIS 2015 Bibliographical-based Facets for Expertise Search
ACIS 2015 Bibliographical-based Facets for Expertise Search
 
A Brief Introduction to Knowledge Acquisition, Representation and Publishing
A Brief Introduction to Knowledge Acquisition, Representation and PublishingA Brief Introduction to Knowledge Acquisition, Representation and Publishing
A Brief Introduction to Knowledge Acquisition, Representation and Publishing
 
Wi 2015 demo_preview
Wi 2015 demo_previewWi 2015 demo_preview
Wi 2015 demo_preview
 
An overview of text mining and sentiment analysis for Decision Support System
An overview of text mining and sentiment analysis for Decision Support SystemAn overview of text mining and sentiment analysis for Decision Support System
An overview of text mining and sentiment analysis for Decision Support System
 

Último

Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
David Celestin
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
amilabibi1
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
Kayode Fayemi
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
Kayode Fayemi
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
raffaeleoman
 

Último (15)

My Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle BaileyMy Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle Bailey
 
Digital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of DrupalDigital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of Drupal
 
Dreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video TreatmentDreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video Treatment
 
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
 
lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
 
ICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdfICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdf
 
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdfThe workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
 
Dreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIIDreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio III
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar Training
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
 
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfAWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
 
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdfSOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
 

Concepts and Challenges of Text Retrieval for Search Engine

  • 1. CONCEPTS AND CHALLENGES OFTEXT RETRIEVAL FOR SEARCH ENGINES PRE CONFERENCETUTORIAL by Gan Keng Hoon 16th August 2016 1
  • 2. THISTUTORIAL Overview:Text Retrieval & Search Engine Concept : Basics ofText Retrieval Challenges: Semantics & Specific Case: Expert Search Engine 2
  • 4. What Do People Search for? FuYuanhui How to get free Pokeball ? How to write thesis in three month ? keynote speaker ICAICTA 2016 4
  • 5. What Do People Expect ? How to get free Pokeball 5
  • 7. Quiz:Which one is not a Search Engine? 7
  • 8. Type of Search Engine Web Search Engine Google,Yahoo, Bing Domain Specific Search Engine Medline/Pubmed Microsoft Academic Desktop Search Engine Copernic 8
  • 9. ConnectingTwo Ends Search Collection  Web  Domain Specific  Personal  Enterprise Etc. Information Needs I want to know more about the keynotes speech of ICAICTA 2016. I need more Pokeballs Free Of Charge..… What’s so funny about FuYuan Hui?? Scholarship ending soon, three months left to submit my thesis….  Web Sites  Journal Articles  News  Images  Videos  Audio  Scanned Documents  Tweets  Posts  Reviews  Etc… 9
  • 10. A Conceptual Model forText Retrieval Information Needs Query Search Collection Document Representation Retrieved Documents Indexing Formulation Retrieval Function Relevance Feedback Natural Language Content Analysis 10
  • 12. SearchCollection (Retrieval Unit) Web pages, email, books, news stories, scholarly papers, text messages,Word™, Powerpoint™, PDF, forum postings, patents, etc. Retrieval unit can be Part of document, e.g. a paragraph, a slide, a page etc. In the form different structure, html, xml, text etc. In different sizes/length. 12
  • 13. Document Representation FullText Representation Keep everything. Complete. Require huge resources.Too much may not be good. Reduced (partial) Content Representation Remove not important contents e.g. stopwords. Standardization to reduce overlapped contents e.g. stemming. Retain only important contents, e.g. noun phrases, header etc. 13
  • 14. Document Representation Think of representation as some ways of storing the document. Bag of Words Model Store the words as the bag (multiset) of its words, disregarding grammar and even word order. Document 1: "The cat sat on the hat" Document 2: "The dog ate the cat and the hat" From these two documents, a word list is constructed: { the, cat, sat, on, hat, dog, ate, and } The list has 8 distinct words. Document 1: { 2, 1, 1, 1, 1, 0, 0, 0 } Document 2 : { 3, 1, 0, 0, 1, 1, 1, 1} 14
  • 15. Information Needs & Query Information Needs != Query Recall the information needs Query: icaicta 2016 keynote Information Need: I want to know more about the keynotes speech of ICAICTA 2016 Query: free pokeball Information Need: I need more Pokeballs. I don’t want to pay. No cheat codes. 15
  • 16. Retrieved Documents From the original collection, a subset of documents are obtained. What is the factor that determines what document to return? SimpleTerm Matching Approach 1. Compare the terms in a document and query. 2. Compute “similarity” between each document in the collection and the query based on the terms they have in common. 3. Sorting the document in order of decreasing similarity with the query. 4. The outputs are a ranked list and displayed to the user - the top ones are more relevant as judged by the system. 16
  • 17. Indexing Convert documents into representation or data structure to improve the efficiency of retrieval. To generate a set of useful terms called indexes. Why? Many variety of words used in texts, but not all are important. Among the important words, some are more contextually relevant. Some basic processes involved •Tokenization •StopWords Removal •Stemming •Phrases •Inverted File 17
  • 18. Indexing (Tokenization) Convert a sequence of characters into a sequence of tokens with some basic meaning. “The cat chases the mouse.” “Bigcorp's 2007 bi-annual report showed profits rose 10%.” the cat chases the mouse bigcorp 2007 bi annual report showed profits rose 10% 18
  • 19. Indexing (Tokenization) Token can be single or multiple terms. “Samsung Galaxy S7 Edge, redefines what a phone can do.” samsung galaxy s7 edge redefines what a phone can do samsung galaxy s7 edge redefines what a …. or 19
  • 20. Indexing (Tokenization) Common Issues 1. Capitalized words can have different meaning from lower case words Bush fires the officer. Query: Bush fire The bush fire lasted for 3 days. Query: bush fire 2. Apostrophes can be a part of a word, a part of a possessive, or just a mistake rosie o'donnell, can't, don't, 80's, 1890's, men's straw hats, master's degree, england's ten largest cities, shriner's 20
  • 21. Indexing (Tokenization) 3. Numbers can be important, including decimals nokia 3250, top 10 courses, united 93, quicktime 6.5 pro, 92.3 the beat, 288358 4. Periods can occur in numbers, abbreviations, URLs, ends of sentences, and other situations I.B.M., Ph.D., cs.umass.edu, F.E.A.R. Note: tokenizing steps for queries must be identical to steps for documents 21
  • 22. Indexing (Stopping) Top 50 Words from AP89 News Collection Recall, Indexes should be useful term links to a document. Are the terms on the right figure useful? 22
  • 23. Indexing (Stopping) Stopword list can be created from high-frequency words or based on a standard list Lists are customized for applications, domains, and even parts of documents e.g., “click” is a good stopword for anchor text Best policy is to index all words in documents, make decisions about which words to use at query time? 23
  • 24. Indexing (Stemming) Many morphological variations of words inflectional (plurals, tenses) derivational (making verbs nouns etc.) In most cases, these have the same or very similar meanings Stemmers attempt to reduce morphological variations of words to a common stem usually involves removing suffixes Can be done at indexing time or as part of query processing (like stopwords) 24
  • 25. Indexing (Stemming) Porter Stemmer Algorithmic stemmer used in IR experiments since the 70s Consists of a series of rules designed to the longest possible suffix at each step Produces stems not words Example Step 1 (right figure) 25
  • 26. Indexing (Phrases) Recall, token, meaningful tokens are better indexes, e.g. phrases. Text processing issue – how are phrases recognized? Three possible approaches: Identify syntactic phrases using a part-of-speech (POS) tagger Use word n-grams Store word positions in indexes and use proximity operators in queries 26
  • 27. Indexing (Phrases) Example Noun Phrases * Other method like N-Gram 27
  • 28. Indexing (Inverted Index) Recall, indexes are designed to support search. Each index term is associated with an inverted list Contains lists of documents, or lists of word occurrences in documents, and other information. Each entry is called a posting. The part of the posting that refers to a specific document or location is called a pointer Each document in the collection is given a unique number Lists are usually document-ordered (sorted by document number) 28
  • 29. Indexing (Inverted Index) Sample collection. 4 sentences fromWikipedia entry for Tropical Fish 29
  • 30. Indexing (Inverted Index) Simple inverted index. 30
  • 31. Indexing (Inverted Index) Inverted index with counts. Support better ranking algorithms. 31
  • 32. Indexing (Inverted Index) Inverted index with positions. Support proximity matching. 32
  • 33. Retrieval Function Ranking Documents are retrieved in sorted order according to a score computing using the document representation, the query, and a ranking algorithm 33
  • 34. Retrieval Function (Vector Space Model) Ranked based method. Documents and query represented by a vector of term weights. Collection represented by a matrix of term weights. 34
  • 35. Retrieval Function (Vector Space Model) borneo daily new north straits times D1 0 0 1 0 1 1 D2 0 1 1 0 1 0 D3 1 0 0 1 0 1 D1: new straits times D2: new straits daily D3 : north borneo times Vector of useful terms 35
  • 36. Retrieval Function (Vector Space Model) borneo daily new north straits times D1 0 0 0.176 0 0.176 0.176 D2 0 0.477 0.176 0 0.176 0 D3 0.477 0 0 0.477 0 0.176 idf (borneo) = log(3/1) =0.477 idf (daily) = log(3/1) = 0.477 idf (new) = log(3/2) =0.176 idf (north) = log(3/1) = 0.477 idf (straits) = log(3/2) = 0.176 idf (times) = log(3/2) = 0.176 then multiply by tf tf.idf weight Term frequency weight measures importance in document: Inverse document frequency measures importance in collection: Note: Doc Length,Term Location,Term Semantic Meaning 36
  • 37. Retrieval Function (Vector Space Model) Documents ranked by distance between points representing query and documents Similarity measure more common than a distance or dissimilarity measure e.g. Cosine correlation 37
  • 38. Retrieval Function (Vector Space Model) Consider two documents D1, D2 and a query Q Q = “straits times” Compare against collection, D1 = “new straits times” (borneo, daily, new, north, straits, times) Q = (0, 0, 0, 0, 0.176, 0.176) D1 = (0, 0, 0.176, 0, 0.176, 0.176) D2 = (0, 0.477, 0.176, 0, 0.176, 0) 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐷𝐷𝐷, 𝑄𝑄 = 0∗0 + 0∗0 + 0.176∗0 + 0∗0 + 0.176∗0.176 +(0.176∗0.176) 0.1762 +0.1762 +0.1762 (0.1762 +0.1762 ) =0.816 Find Cosine (D2,Q). Which document is more relevant? 38
  • 39. Evaluation A must to evaluate the retrieval function, preprocessing steps etc. StandardCollection Task specific Human experts are used to judge relevant results. Performance Metric Precision Recall 39
  • 40. Evaluation (Collection) Test collections consisting of documents, queries, and relevance judgments, e.g., 40
  • 41. Evaluation (Collection) Example query and narrative for golden standard. 41
  • 42. Evaluation (Effectiveness Measures) A is set of relevant documents, B is set of retrieved documents 42
  • 44. Evaluation (Ranking Effectiveness) Recall@4 = 3/4 Precision@4 = 3/4 Recall@2 = 2/4 Precision@2 = 2/2 44
  • 45. Challenges SocialTexts, e.g.Tweets, Posts Hard question. Hard Disk ? Named Entity  Various levels and aspects of annotations 45
  • 46. Challenges Small Data Specific search Improve semantics extensively Big Data Multi modal retrieval Connecting many medias 46
  • 47. Case: Adding Semantics Bibliography
  • 48. Improve Search Results Display Facet-based semantic UsefulTerms Demo: ir.cs.usm.my