CSE509 Lecture 2: Basic Information Retrieval Models
1. CSE509: Introduction to Web Science and Technology Lecture 2: Basic Information Retrieval Models Muhammad AtifQureshi and ArjumandYounus Web Science Research Group Institute of Business Administration (IBA)
2. Last Time… What is Web Science? Why We Need Web Science? Implications of Web Science Web Science Case-Study: Diff-IE July 16, 2011
3. Today Basic Information Retrieval Approaches Bag of Words Assumption Information Retrieval Models Boolean model Vector-space model Topic/Language models July 16, 2011
4. Computer-based Information Management Basic problem How to use computers to help humans store, organize and retrieve information? What approaches have been taken and what has been successful? July 16, 2011
5. Three Major Approaches Database approach Expert-system approach Information retrieval approach July 16, 2011
6. Database Approach Information is stored in a highly-structured way Data stored in relational tables as tuples Simple data model and query language Relational model and SQL query language Clear interpretation of data and query No ambition to be “intelligent” like humans Main focus on system efficiency Pros and cons? July 16, 2011
7. Expert-System Approach Information stored as a set of logical predicates Bird(X).Cat(X).Fly(X) Given a query, the system infers the answer through logical inference Bird(Ostrich) Fly(Ostrich)? Pros and cons? July 16, 2011
8. Information Retrieval Approach Uses existing text documents as information source No special structuring or database construction required Text-based query language Keyword-based query or natural language query Pros and cons? July 16, 2011
9. Database vs. Information Retrieval July 16, 2011 IR Databases What we’re retrieving Mostly unstructured. Free text with some metadata. Structured data. Clear semantics based on a formal model. Queries we’re posing Vague, imprecise information needs (often expressed in natural language). Formally (mathematically) defined queries. Unambiguous. Results we get Sometimes relevant, often not. Exact. Always correct in a formal sense. Interaction with system Interaction is important. One-shot queries. Other issues Issues downplayed. Concurrency, recovery, atomicity are all critical.
10. Main Challenge of Information Retrieval Approach Interpretation of query and data is not straightforward Both queries and data are fuzzy Unstructured text and natural language query July 16, 2011
11. Need forInformation Retrieval Model Computers do not understand the document or the query For implementation of information retrieval approach a computerizable model is essential July 16, 2011
12. What is a Model? A model is a construct designed to help us understand a complex system A particular way of “looking at things” Models work by adopting simplifying assumptions Different types of models Conceptual model Physical analog model Mathematical models … July 16, 2011
13. Major Simplification: Bag of Words Consider each document as a “bag of words” Consider queries as a “bag of words” as well Great oversimplification, but works adequately in many cases Still how do we match documents and queries? July 16, 2011
14. Sample Document July 16, 2011 16 × said 14 × McDonalds 12 × fat 11 × fries 8 × new 6 × company french nutrition 5 × food oil percent reduce taste Tuesday … McDonald's slims down spuds Fast-food chain to reduce certain types of fat in its french fries with new cooking oil. NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier. But does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA. But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste. Shares of Oak Brook, Ill.-based McDonald's (MCD: down $0.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down $0.80 to $34.91, Research, Estimates) would follow suit. Neither company could immediately be reached for comment. … “Bag of Words”
18. A vector is a set of values recorded in any consistent orderJuly 16, 2011 “The quick brown fox jumped over the lazy dog’s back” [ 1 1 1 1 1 1 1 1 2 ] 1st position corresponds to “back” 2nd position corresponds to “brown” 3rd position corresponds to “dog” 4th position corresponds to “fox” 5th position corresponds to “jump” 6th position corresponds to “lazy” 7th position corresponds to “over” 8th position corresponds to “quick” 9th position corresponds to “the”
19. Representing Documents July 16, 2011 aid 0 1 all 0 1 back 1 0 brown 1 0 come 0 1 dog 1 0 fox 1 0 good 0 1 jump 1 0 lazy 1 0 men 0 1 now 0 1 over 1 0 party 0 1 quick 1 0 their 0 1 time 0 1 Document 1 Term Document 1 Document 2 The quick brown fox jumped over the lazy dog’s back. Stopword List for is of Document 2 the to Now is the time for all good men to come to the aid of their party.
20. Boolean Model Return all documents that contain the words in the query Weights assigned to terms are either “0” or “1” “0” represents “absence”: term isn’t in the document “1” represents “presence”: term is in the document No notion of “ranking” A document is either a match or a non-match July 16, 2011
21. Evaluation: Precision and Recall Q: Are all matching documents what users want? Basic idea: a model is good if it returns a document if and only if it is relevant R: set of relevant documents D: set of documents returned by a model July 16, 2011
22. Ranked Retrieval Order documents by how likely they are to be relevant to the information need Arranging documents by relevance is Closer to how humans think: some documents are “better” than others Closer to user behavior: users can decide when to stop reading Best (partial) match: documents need not have all query terms Although documents with more query terms should be “better” July 16, 2011
23. Similarity-Based Queries Let’s replace relevance with “similarity” Rank documents by their similarity with the query Treat the query as if it were a document Create a query bag-of-words Find its similarity to each document Rank order the documents by similarity Surprisingly, this works pretty well! July 16, 2011
24. Vector Space Model Documents “close together” in vector space “talk about” the same things Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”) July 16, 2011 t3 d2 d3 d1 θ φ t1 d5 t2 d4
33. Normalizing Document Vectors Recall our similarity function: Normalize document vectors in advance Use the “cosine normalization” method: divide each term weight through by length of vector July 16, 2011
36. Language Models Language models Based on the notion of probabilities and processes for generating text Documents are ranked based on the probability that they generated the query Best/partial match July 16, 2011
37. What is a Language Model? Probability distribution over strings of text How likely is a string in a given “language”? Probabilities depend on what language we’re modeling July 16, 2011 p1 = P(“a quick brown dog”) p2 = P(“dog quick a brown”) p3 = P(“быстрая brown dog”) p4 = P(“быстраясобака”) In a language model for English: p1 > p2 > p3 > p4 In a language model for Russian: p1 < p2 < p3 < p4
38.
39. Think of all the things that have ever been said or will ever be said, of any length
47. The probability of a string given a modelThe probability of a sequence of words decomposes into a product of the probabilities of individual words
48.
49. An Example July 16, 2011 Model M P(w) w 0.2 the 0.1 a 0.01 man 0.01 woman 0.03 said 0.02 likes … the man likes the 0.2 0.01 0.02 0.2 0.01 multiply P(“the man likes the woman”|M) = P(the|M) P(man|M) P(likes|M) P(the|M) P(man|M) = 0.00000008
Highly successful, all major businesses use an RDB system
Popular approach in 80s but has not been successful for gen. IR
System returns best-matching documents given the query Had limited appeal until the Web became popular
System returns best-matching documents given the query Had limited appeal until the Web became popular
System returns best-matching documents given the query Had limited appeal until the Web became popular
What documents are good matches for a query?
What documents are good matches for a query?
What documents are good matches for a query?How do we represent the complexities of language?Keeping in mind that computers don’t “understand” documents or queriesSimple, yet effective approach: “bag of words”Treat all the words in a document as index terms for that documentAssign a “weight” to each term based on its “importance”Disregard order, structure, meaning, etc. of the words
Bag-of-words approach:Information retrieval is all (and only) about matching words in documents with words in queriesObviously, not true…But it works pretty well!
What documents are good matches for a query?How do we represent the complexities of language?Keeping in mind that computers don’t “understand” documents or queriesSimple, yet effective approach: “bag of words”Treat all the words in a document as index terms for that documentAssign a “weight” to each term based on its “importance”Disregard order, structure, meaning, etc. of the words
Processing the enormous quantities of data necessary for these advances requires largeclusters, making distributed computing paradigms more crucial than ever. MapReduce is a programmingmodel for expressing distributed computations on massive datasets and an execution frameworkfor large-scale data processing on clusters of commodity servers. The programming model providesan easy-to-understand abstraction for designing scalable algorithms, while the execution frameworktransparently handles many system-level details, ranging from scheduling to synchronization to faulttolerance.MapReduce+Cloud Computing Debate