SlideShare a Scribd company logo
1 of 35
CSE509: Introduction to Web Science and Technology Lecture 2: Basic Information Retrieval Models Muhammad AtifQureshi and ArjumandYounus Web Science Research Group Institute of Business Administration (IBA)
Last Time… What is Web Science? Why We Need Web Science? Implications of Web Science Web Science Case-Study: Diff-IE July 16, 2011
Today Basic Information Retrieval Approaches Bag of Words Assumption Information Retrieval Models Boolean model Vector-space model Topic/Language models July 16, 2011
Computer-based Information Management Basic  problem How to use computers to help humans store, organize and retrieve information? What approaches have been taken and what has been successful? July 16, 2011
Three Major Approaches Database approach Expert-system approach Information retrieval approach July 16, 2011
Database Approach Information is stored in a highly-structured way Data stored in relational tables as tuples Simple data model and query language Relational model and SQL query language Clear interpretation of data and query No ambition to be “intelligent” like humans Main focus on system efficiency Pros and cons? July 16, 2011
Expert-System Approach Information stored as a set of logical predicates Bird(X).Cat(X).Fly(X) Given a query, the system infers the answer through logical inference Bird(Ostrich)  Fly(Ostrich)? Pros and cons? July 16, 2011
Information Retrieval Approach Uses existing text documents as information source No special structuring or database construction required Text-based query language Keyword-based query or natural language query Pros and cons? July 16, 2011
Database vs. Information Retrieval July 16, 2011 IR Databases What we’re retrieving Mostly unstructured.  Free text with some metadata. Structured data. Clear semantics based on a formal model. Queries we’re posing Vague, imprecise information needs (often expressed in natural language). Formally (mathematically) defined queries.  Unambiguous. Results we get Sometimes relevant, often not. Exact.  Always correct in a formal sense. Interaction with system Interaction is important. One-shot queries. Other issues Issues downplayed. Concurrency, recovery, atomicity are all critical.
Main Challenge of Information Retrieval Approach Interpretation of query and data is not straightforward Both queries and data are fuzzy Unstructured text and natural language query July 16, 2011
Need forInformation Retrieval Model Computers do not understand the document or the query For implementation of information retrieval approach a computerizable model is essential July 16, 2011
What is a Model? A model is a construct designed to help us understand a complex system A particular way of “looking at things” Models work by adopting simplifying assumptions Different types of models Conceptual model Physical analog model Mathematical models … July 16, 2011
Major Simplification: Bag of Words Consider each document as a “bag of words” Consider queries as a “bag of words” as well Great oversimplification, but works adequately in many cases Still how do we match documents and queries? July 16, 2011
Sample Document July 16, 2011 16 × said  14 × McDonalds 12 × fat 11 × fries 8 × new 6 × company french nutrition 5 × food oil percent reduce taste Tuesday … McDonald's slims down spuds Fast-food chain to reduce certain types of fat in its french fries with new cooking oil. NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier. But does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA. But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste. Shares of Oak Brook, Ill.-based McDonald's (MCD: down $0.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down $0.80 to $34.91, Research, Estimates) would follow suit. Neither company could immediately be reached for comment. … “Bag of Words”
Vector Representation ,[object Object]
Computational efficiency
Ease of manipulation
A vector is a set of values recorded in any consistent orderJuly 16, 2011 “The quick brown fox jumped over the lazy dog’s back”   [ 1 1 1 1 1 1 1 1 2 ] 1st position corresponds to “back” 2nd position corresponds to “brown” 3rd position corresponds to “dog” 4th position corresponds to “fox” 5th position corresponds to “jump” 6th position corresponds to “lazy” 7th position corresponds to “over” 8th position corresponds to “quick” 9th position corresponds to “the”
Representing Documents July 16, 2011 aid 0 1 all 0 1 back 1 0 brown 1 0 come 0 1 dog 1 0 fox 1 0 good 0 1 jump 1 0 lazy 1 0 men 0 1 now 0 1 over 1 0 party 0 1 quick 1 0 their 0 1 time 0 1 Document 1   Term Document 1 Document 2 The quick brown  fox jumped over  the lazy dog’s  back.  Stopword      List for is of Document 2 the to Now is the time  for all good men  to come to the  aid of their party.
Boolean Model Return all documents that contain the words in the query Weights assigned to terms are either “0” or “1”  “0” represents “absence”: term isn’t in the document “1” represents “presence”: term is in the document No notion of “ranking” A document is either a match or a non-match July 16, 2011
Evaluation: Precision and Recall Q: Are all matching documents what users want? Basic idea: a model is good if it returns a document if and only if it is relevant R: set of relevant documents 	D: set of documents returned by a model July 16, 2011
Ranked Retrieval Order documents by how likely they are to be relevant to the information need Arranging documents by relevance is Closer to how humans think: some documents are “better” than others Closer to user behavior: users can decide when to stop reading Best (partial) match: documents need not have all query terms Although documents with more query terms should be “better” July 16, 2011
Similarity-Based Queries Let’s replace relevance with “similarity” Rank documents by their similarity with the query Treat the query as if it were a document Create a query bag-of-words Find its similarity to each document Rank order the documents by similarity Surprisingly, this works pretty well! July 16, 2011
Vector Space Model Documents “close together” in vector space “talk about” the same things 	Therefore,  retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”) July 16, 2011 t3 d2 d3 d1 θ φ t1 d5 t2 d4
Similarity Metric Angle between the vectors July 16, 2011 Query Vector Document Vector
How do we Weight Document Terms? July 16, 2011 ,[object Object]
Terms that appear often in a document should get high weights
Terms that appear in many documents should get low weights
How do we capture this mathematically?
Term frequency
Inverse document frequency,[object Object]
TF.IDF Example July 16, 2011 Wi,j tf idf 1 2 3 4 1 2 3 4 5 2 1.51 0.60 complicated 0.301 4 1 3 0.50 0.13 0.38 contaminated 0.125 5 4 3 0.63 0.50 0.38 fallout 0.125 6 3 3 2 information 0.000 1 0.60 interesting 0.602 3 7 0.90 2.11 nuclear 0.301 6 1 4 0.75 0.13 0.50 retrieval 0.125 2 1.20 siberia 0.602
Normalizing Document Vectors Recall our similarity function: Normalize document vectors in advance Use the “cosine normalization” method: divide each term weight through by length of vector July 16, 2011
Normalization Example July 16, 2011 tf 1 2 3 4 5 2 4 1 3 5 4 3 6 3 3 2 1 3 7 6 1 4 2 Wi,j W'i,j idf 1 2 3 4 1 2 3 4 1.51 0.60 0.57 0.69 complicated 0.301 0.50 0.13 0.38 0.29 0.13 0.14 contaminated 0.125 0.63 0.50 0.38 0.37 0.19 0.44 fallout 0.125 information 0.000 0.60 0.62 interesting 0.602 0.90 2.11 0.53 0.79 nuclear 0.301 0.75 0.13 0.50 0.77 0.05 0.57 retrieval 0.125 1.20 0.71 siberia 0.602 1.70 0.97 2.67 0.87 Length
Retrieval Example July 16, 2011 W'i,j 1 2 3 4 0.57 0.69 0.29 0.13 0.14 0.37 0.19 0.44 0.62 0.53 0.79 0.77 0.05 0.57 0.71 Query: contaminated retrieval W'i,j query complicated 1 contaminated fallout Ranked list: Doc 2 Doc 4 Doc 1 Doc 3 information interesting nuclear 1 retrieval siberia 0.29 0.9 0.19 0.57 similarity score

More Related Content

Similar to CSE509 Lecture 2: Basic Information Retrieval Models

One Scientist’s Wish List for Scientific Publishers
One Scientist’s Wish List for Scientific PublishersOne Scientist’s Wish List for Scientific Publishers
One Scientist’s Wish List for Scientific PublishersPhilip Bourne
 
Model of information retrieval (3)
Model  of information retrieval (3)Model  of information retrieval (3)
Model of information retrieval (3)9866825059
 
Open Data Driven Scholarly Communication in 2020
Open Data Driven Scholarly Communication in 2020Open Data Driven Scholarly Communication in 2020
Open Data Driven Scholarly Communication in 2020Philip Bourne
 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibEl Habib NFAOUI
 
Ten Simple Rules for Open Access Publishers
Ten Simple Rules for Open Access PublishersTen Simple Rules for Open Access Publishers
Ten Simple Rules for Open Access PublishersPhilip Bourne
 
Dynamic Search Using Semantics & Statistics
Dynamic Search Using Semantics & StatisticsDynamic Search Using Semantics & Statistics
Dynamic Search Using Semantics & StatisticsPaul Hofmann
 
Force11: Enabling transparency and efficiency in the research landscape
Force11: Enabling transparency and efficiency in the research landscapeForce11: Enabling transparency and efficiency in the research landscape
Force11: Enabling transparency and efficiency in the research landscapemhaendel
 
Presentation to the J. Craig Venter Institute, Dec. 2014
Presentation to the J. Craig Venter Institute, Dec. 2014Presentation to the J. Craig Venter Institute, Dec. 2014
Presentation to the J. Craig Venter Institute, Dec. 2014Mark Wilkinson
 
Corpora, Blogs and Linguistic Variation (Paderborn)
Corpora, Blogs and Linguistic Variation (Paderborn)Corpora, Blogs and Linguistic Variation (Paderborn)
Corpora, Blogs and Linguistic Variation (Paderborn)Cornelius Puschmann
 
Chapter 1 Introduction to ISR (1).pdf
Chapter 1 Introduction to ISR (1).pdfChapter 1 Introduction to ISR (1).pdf
Chapter 1 Introduction to ISR (1).pdfJemalNesre1
 
Interlinking educational data to Web of Data (Thesis presentation)
Interlinking educational data to Web of Data (Thesis presentation)Interlinking educational data to Web of Data (Thesis presentation)
Interlinking educational data to Web of Data (Thesis presentation)Enayat Rajabi
 
2nd Spinoza workshop: Looking at the Long Tail - introductory slides
2nd Spinoza workshop: Looking at the Long Tail - introductory slides2nd Spinoza workshop: Looking at the Long Tail - introductory slides
2nd Spinoza workshop: Looking at the Long Tail - introductory slidesFilip Ilievski
 
Readings in Database Systems
Readings in Database SystemsReadings in Database Systems
Readings in Database Systemsmustafa sarac
 
Strata 2012: Big Data and Bibliometrics
Strata 2012: Big Data and BibliometricsStrata 2012: Big Data and Bibliometrics
Strata 2012: Big Data and BibliometricsWilliam Gunn
 
Semantic IoT Semantic Inter-Operability Practices - Part 1
Semantic IoT Semantic Inter-Operability Practices - Part 1Semantic IoT Semantic Inter-Operability Practices - Part 1
Semantic IoT Semantic Inter-Operability Practices - Part 1iotest
 

Similar to CSE509 Lecture 2: Basic Information Retrieval Models (20)

One Scientist’s Wish List for Scientific Publishers
One Scientist’s Wish List for Scientific PublishersOne Scientist’s Wish List for Scientific Publishers
One Scientist’s Wish List for Scientific Publishers
 
Model of information retrieval (3)
Model  of information retrieval (3)Model  of information retrieval (3)
Model of information retrieval (3)
 
Open Data Driven Scholarly Communication in 2020
Open Data Driven Scholarly Communication in 2020Open Data Driven Scholarly Communication in 2020
Open Data Driven Scholarly Communication in 2020
 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_Habib
 
Ten Simple Rules for Open Access Publishers
Ten Simple Rules for Open Access PublishersTen Simple Rules for Open Access Publishers
Ten Simple Rules for Open Access Publishers
 
Dynamic Search Using Semantics & Statistics
Dynamic Search Using Semantics & StatisticsDynamic Search Using Semantics & Statistics
Dynamic Search Using Semantics & Statistics
 
Force11: Enabling transparency and efficiency in the research landscape
Force11: Enabling transparency and efficiency in the research landscapeForce11: Enabling transparency and efficiency in the research landscape
Force11: Enabling transparency and efficiency in the research landscape
 
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
 
Presentation to the J. Craig Venter Institute, Dec. 2014
Presentation to the J. Craig Venter Institute, Dec. 2014Presentation to the J. Craig Venter Institute, Dec. 2014
Presentation to the J. Craig Venter Institute, Dec. 2014
 
Corpora, Blogs and Linguistic Variation (Paderborn)
Corpora, Blogs and Linguistic Variation (Paderborn)Corpora, Blogs and Linguistic Variation (Paderborn)
Corpora, Blogs and Linguistic Variation (Paderborn)
 
Database Essay
Database EssayDatabase Essay
Database Essay
 
Reproducibility
ReproducibilityReproducibility
Reproducibility
 
Chapter 1 Introduction to ISR (1).pdf
Chapter 1 Introduction to ISR (1).pdfChapter 1 Introduction to ISR (1).pdf
Chapter 1 Introduction to ISR (1).pdf
 
Peer Review and Science2.0
Peer Review and Science2.0Peer Review and Science2.0
Peer Review and Science2.0
 
Interlinking educational data to Web of Data (Thesis presentation)
Interlinking educational data to Web of Data (Thesis presentation)Interlinking educational data to Web of Data (Thesis presentation)
Interlinking educational data to Web of Data (Thesis presentation)
 
2nd Spinoza workshop: Looking at the Long Tail - introductory slides
2nd Spinoza workshop: Looking at the Long Tail - introductory slides2nd Spinoza workshop: Looking at the Long Tail - introductory slides
2nd Spinoza workshop: Looking at the Long Tail - introductory slides
 
Ir 02
Ir   02Ir   02
Ir 02
 
Readings in Database Systems
Readings in Database SystemsReadings in Database Systems
Readings in Database Systems
 
Strata 2012: Big Data and Bibliometrics
Strata 2012: Big Data and BibliometricsStrata 2012: Big Data and Bibliometrics
Strata 2012: Big Data and Bibliometrics
 
Semantic IoT Semantic Inter-Operability Practices - Part 1
Semantic IoT Semantic Inter-Operability Practices - Part 1Semantic IoT Semantic Inter-Operability Practices - Part 1
Semantic IoT Semantic Inter-Operability Practices - Part 1
 

More from Web Science Research Group at Institute of Business Administration, Karachi, Pakistan (8)

ReThinking CS Curriculum for Pakistan
ReThinking CS Curriculum for PakistanReThinking CS Curriculum for Pakistan
ReThinking CS Curriculum for Pakistan
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
Social Media Mining and Analytics
Social Media Mining and AnalyticsSocial Media Mining and Analytics
Social Media Mining and Analytics
 
CSE509 Lecture 6
CSE509 Lecture 6CSE509 Lecture 6
CSE509 Lecture 6
 
CSE509 Lecture 5
CSE509 Lecture 5CSE509 Lecture 5
CSE509 Lecture 5
 
CSE509 Lecture 4
CSE509 Lecture 4CSE509 Lecture 4
CSE509 Lecture 4
 
CSE509 Lecture 3
CSE509 Lecture 3CSE509 Lecture 3
CSE509 Lecture 3
 
CSE509 Lecture 1
CSE509 Lecture 1CSE509 Lecture 1
CSE509 Lecture 1
 

Recently uploaded

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 

CSE509 Lecture 2: Basic Information Retrieval Models

  • 1. CSE509: Introduction to Web Science and Technology Lecture 2: Basic Information Retrieval Models Muhammad AtifQureshi and ArjumandYounus Web Science Research Group Institute of Business Administration (IBA)
  • 2. Last Time… What is Web Science? Why We Need Web Science? Implications of Web Science Web Science Case-Study: Diff-IE July 16, 2011
  • 3. Today Basic Information Retrieval Approaches Bag of Words Assumption Information Retrieval Models Boolean model Vector-space model Topic/Language models July 16, 2011
  • 4. Computer-based Information Management Basic problem How to use computers to help humans store, organize and retrieve information? What approaches have been taken and what has been successful? July 16, 2011
  • 5. Three Major Approaches Database approach Expert-system approach Information retrieval approach July 16, 2011
  • 6. Database Approach Information is stored in a highly-structured way Data stored in relational tables as tuples Simple data model and query language Relational model and SQL query language Clear interpretation of data and query No ambition to be “intelligent” like humans Main focus on system efficiency Pros and cons? July 16, 2011
  • 7. Expert-System Approach Information stored as a set of logical predicates Bird(X).Cat(X).Fly(X) Given a query, the system infers the answer through logical inference Bird(Ostrich)  Fly(Ostrich)? Pros and cons? July 16, 2011
  • 8. Information Retrieval Approach Uses existing text documents as information source No special structuring or database construction required Text-based query language Keyword-based query or natural language query Pros and cons? July 16, 2011
  • 9. Database vs. Information Retrieval July 16, 2011 IR Databases What we’re retrieving Mostly unstructured. Free text with some metadata. Structured data. Clear semantics based on a formal model. Queries we’re posing Vague, imprecise information needs (often expressed in natural language). Formally (mathematically) defined queries. Unambiguous. Results we get Sometimes relevant, often not. Exact. Always correct in a formal sense. Interaction with system Interaction is important. One-shot queries. Other issues Issues downplayed. Concurrency, recovery, atomicity are all critical.
  • 10. Main Challenge of Information Retrieval Approach Interpretation of query and data is not straightforward Both queries and data are fuzzy Unstructured text and natural language query July 16, 2011
  • 11. Need forInformation Retrieval Model Computers do not understand the document or the query For implementation of information retrieval approach a computerizable model is essential July 16, 2011
  • 12. What is a Model? A model is a construct designed to help us understand a complex system A particular way of “looking at things” Models work by adopting simplifying assumptions Different types of models Conceptual model Physical analog model Mathematical models … July 16, 2011
  • 13. Major Simplification: Bag of Words Consider each document as a “bag of words” Consider queries as a “bag of words” as well Great oversimplification, but works adequately in many cases Still how do we match documents and queries? July 16, 2011
  • 14. Sample Document July 16, 2011 16 × said 14 × McDonalds 12 × fat 11 × fries 8 × new 6 × company french nutrition 5 × food oil percent reduce taste Tuesday … McDonald's slims down spuds Fast-food chain to reduce certain types of fat in its french fries with new cooking oil. NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier. But does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA. But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste. Shares of Oak Brook, Ill.-based McDonald's (MCD: down $0.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down $0.80 to $34.91, Research, Estimates) would follow suit. Neither company could immediately be reached for comment. … “Bag of Words”
  • 15.
  • 18. A vector is a set of values recorded in any consistent orderJuly 16, 2011 “The quick brown fox jumped over the lazy dog’s back”  [ 1 1 1 1 1 1 1 1 2 ] 1st position corresponds to “back” 2nd position corresponds to “brown” 3rd position corresponds to “dog” 4th position corresponds to “fox” 5th position corresponds to “jump” 6th position corresponds to “lazy” 7th position corresponds to “over” 8th position corresponds to “quick” 9th position corresponds to “the”
  • 19. Representing Documents July 16, 2011 aid 0 1 all 0 1 back 1 0 brown 1 0 come 0 1 dog 1 0 fox 1 0 good 0 1 jump 1 0 lazy 1 0 men 0 1 now 0 1 over 1 0 party 0 1 quick 1 0 their 0 1 time 0 1 Document 1 Term Document 1 Document 2 The quick brown fox jumped over the lazy dog’s back. Stopword List for is of Document 2 the to Now is the time for all good men to come to the aid of their party.
  • 20. Boolean Model Return all documents that contain the words in the query Weights assigned to terms are either “0” or “1” “0” represents “absence”: term isn’t in the document “1” represents “presence”: term is in the document No notion of “ranking” A document is either a match or a non-match July 16, 2011
  • 21. Evaluation: Precision and Recall Q: Are all matching documents what users want? Basic idea: a model is good if it returns a document if and only if it is relevant R: set of relevant documents D: set of documents returned by a model July 16, 2011
  • 22. Ranked Retrieval Order documents by how likely they are to be relevant to the information need Arranging documents by relevance is Closer to how humans think: some documents are “better” than others Closer to user behavior: users can decide when to stop reading Best (partial) match: documents need not have all query terms Although documents with more query terms should be “better” July 16, 2011
  • 23. Similarity-Based Queries Let’s replace relevance with “similarity” Rank documents by their similarity with the query Treat the query as if it were a document Create a query bag-of-words Find its similarity to each document Rank order the documents by similarity Surprisingly, this works pretty well! July 16, 2011
  • 24. Vector Space Model Documents “close together” in vector space “talk about” the same things Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”) July 16, 2011 t3 d2 d3 d1 θ φ t1 d5 t2 d4
  • 25. Similarity Metric Angle between the vectors July 16, 2011 Query Vector Document Vector
  • 26.
  • 27. Terms that appear often in a document should get high weights
  • 28. Terms that appear in many documents should get low weights
  • 29. How do we capture this mathematically?
  • 31.
  • 32. TF.IDF Example July 16, 2011 Wi,j tf idf 1 2 3 4 1 2 3 4 5 2 1.51 0.60 complicated 0.301 4 1 3 0.50 0.13 0.38 contaminated 0.125 5 4 3 0.63 0.50 0.38 fallout 0.125 6 3 3 2 information 0.000 1 0.60 interesting 0.602 3 7 0.90 2.11 nuclear 0.301 6 1 4 0.75 0.13 0.50 retrieval 0.125 2 1.20 siberia 0.602
  • 33. Normalizing Document Vectors Recall our similarity function: Normalize document vectors in advance Use the “cosine normalization” method: divide each term weight through by length of vector July 16, 2011
  • 34. Normalization Example July 16, 2011 tf 1 2 3 4 5 2 4 1 3 5 4 3 6 3 3 2 1 3 7 6 1 4 2 Wi,j W'i,j idf 1 2 3 4 1 2 3 4 1.51 0.60 0.57 0.69 complicated 0.301 0.50 0.13 0.38 0.29 0.13 0.14 contaminated 0.125 0.63 0.50 0.38 0.37 0.19 0.44 fallout 0.125 information 0.000 0.60 0.62 interesting 0.602 0.90 2.11 0.53 0.79 nuclear 0.301 0.75 0.13 0.50 0.77 0.05 0.57 retrieval 0.125 1.20 0.71 siberia 0.602 1.70 0.97 2.67 0.87 Length
  • 35. Retrieval Example July 16, 2011 W'i,j 1 2 3 4 0.57 0.69 0.29 0.13 0.14 0.37 0.19 0.44 0.62 0.53 0.79 0.77 0.05 0.57 0.71 Query: contaminated retrieval W'i,j query complicated 1 contaminated fallout Ranked list: Doc 2 Doc 4 Doc 1 Doc 3 information interesting nuclear 1 retrieval siberia 0.29 0.9 0.19 0.57 similarity score
  • 36. Language Models Language models Based on the notion of probabilities and processes for generating text Documents are ranked based on the probability that they generated the query Best/partial match July 16, 2011
  • 37. What is a Language Model? Probability distribution over strings of text How likely is a string in a given “language”? Probabilities depend on what language we’re modeling July 16, 2011 p1 = P(“a quick brown dog”) p2 = P(“dog quick a brown”) p3 = P(“быстрая brown dog”) p4 = P(“быстраясобака”) In a language model for English: p1 > p2 > p3 > p4 In a language model for Russian: p1 < p2 < p3 < p4
  • 38.
  • 39. Think of all the things that have ever been said or will ever be said, of any length
  • 40. Count how often each one occurs
  • 41. Is understanding the path to enlightenment?
  • 42. Figure out how meaning and thoughts are expressed
  • 43. Build a model based on this
  • 44.
  • 45. Obviously this is not true…
  • 46. But it seems to work well in practice!
  • 47. The probability of a string given a modelThe probability of a sequence of words decomposes into a product of the probabilities of individual words
  • 48.
  • 49. An Example July 16, 2011 Model M P(w) w 0.2 the 0.1 a 0.01 man 0.01 woman 0.03 said 0.02 likes … the man likes the 0.2 0.01 0.02 0.2 0.01 multiply P(“the man likes the woman”|M) = P(the|M)  P(man|M)  P(likes|M)  P(the|M)  P(man|M) = 0.00000008

Editor's Notes

  1. Highly successful, all major businesses use an RDB system
  2. Popular approach in 80s but has not been successful for gen. IR
  3. System returns best-matching documents given the query Had limited appeal until the Web became popular
  4. System returns best-matching documents given the query Had limited appeal until the Web became popular
  5. System returns best-matching documents given the query Had limited appeal until the Web became popular
  6. What documents are good matches for a query?
  7. What documents are good matches for a query?
  8. What documents are good matches for a query?How do we represent the complexities of language?Keeping in mind that computers don’t “understand” documents or queriesSimple, yet effective approach: “bag of words”Treat all the words in a document as index terms for that documentAssign a “weight” to each term based on its “importance”Disregard order, structure, meaning, etc. of the words
  9. Bag-of-words approach:Information retrieval is all (and only) about matching words in documents with words in queriesObviously, not true…But it works pretty well!
  10. What documents are good matches for a query?How do we represent the complexities of language?Keeping in mind that computers don’t “understand” documents or queriesSimple, yet effective approach: “bag of words”Treat all the words in a document as index terms for that documentAssign a “weight” to each term based on its “importance”Disregard order, structure, meaning, etc. of the words
  11. Processing the enormous quantities of data necessary for these advances requires largeclusters, making distributed computing paradigms more crucial than ever. MapReduce is a programmingmodel for expressing distributed computations on massive datasets and an execution frameworkfor large-scale data processing on clusters of commodity servers. The programming model providesan easy-to-understand abstraction for designing scalable algorithms, while the execution frameworktransparently handles many system-level details, ranging from scheduling to synchronization to faulttolerance.MapReduce+Cloud Computing Debate