O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Week12

570 visualizações

Publicada em

Text Mining

Publicada em: Educação
  • Seja o primeiro a comentar

Week12

  1. 1. Text Mining IS698 Min Song
  2. 2.  The Needs: - Find people as well as documents that can address my information need. - Promote collaboration and knowledge sharing - Leverage existing information access system - The Information Sources: - Email, groupware, online reports, … Example 1: KM People Finder
  3. 3. Example 1: Simple KM People Finder Relevant Docs Search or Navigation System Name Extractor Authority List Query Ranked People Names
  4. 4. Example 1: KM People Finder
  5. 5. • An exploration and analysis of textual (natural-language) datatextual (natural-language) data by automatic and semi automatic means to discover new knowledge. Text Mining Definition  Many definitions in the literature “The non trivial extraction of implicit, previously unknown, and potentially useful information from (large amount of) textual data”.
  6. 6.  What is ““previously unknown”previously unknown” information ?  Strict definition  Information that not even the writer knows.  e.g., Discovering a new method for a hair growth that is described as a side effect for a different procedure  Lenient definition  Rediscover the information that the author encoded in the text  e.g., Automatically extracting a product’s name from a web-page. Text Mining Definition
  7. 7. Outline  Text mining applications  Text characteristics  Text mining process  Learning methods
  8. 8. Text Mining Applications  Marketing: Discover distinct groups of potential buyers according to a user text based profile  e.g. amazon  Industry: Identifying groups of competitors web pages  e.g., competing products and their prices  Job seeking: Identify parameters in searching for jobs  e.g., www.flipdog.com
  9. 9.  Information Retrieval  Indexing and retrieval of textual documents  Information Extraction  Extraction of partial knowledgepartial knowledge in the text  Web Mining  Indexing and retrieval of textual documents and extraction of partial knowledge using the web  Clustering  Generating collections of similar text documents Text Mining Methods
  10. 10. Information Retrieval  Given:  A source of textual documents  A user query (text based) IR System Query E.g. Spam / Text Documents source • Find: • A set (ranked) of documents that are relevant to the query Ranked Documents Document Document Document
  11. 11. Intelligent Information Retrieval  meaning of words  Synonyms “buy” / “purchase”  Ambiguity “bat” (baseball vs. mammal)  order of words in the query  hot dog stand in the amusement park  hot amusement stand in the dog park  user dependency for the data  direct feedback  indirect feedback  authority of the source  IBM is more likely to be an authorized source then my second far cousin
  12. 12.  Given:  A source of textual documents  A well defined limited query (text based)  Find:  Sentences with relevantrelevant information  Extract the relevant information and ignore non-relevant information (important!)  Link related information and output in a predetermined format What is Information Extraction?
  13. 13. Information Extraction: Example  Salvadoran President-elect Alfredo Cristiania condemned the terrorist killing of Attorney General Roberto Garcia Alvarado and accused the Farabundo Marti Natinal Liberation Front (FMLN) of the crime. … Garcia Alvarado, 56, was killed when a bomb placed by urban guerillas on his vehicle exploded as it came to a halt at an intersection in downtown San Salvador. … According to the police and Garcia Alvarado’s driver, who escaped unscathed, the attorney general was traveling with two bodyguards. One of them was injured.  Incident Date: 19 Apr 89  Incident Type: Bombing  Perpetrator Individual ID: “urban guerillas”  Human Target Name: “Roberto Garcia Alvarado”  ...
  14. 14. What is Information Extraction? Extraction System Documents source Ranked Documents Relevant Info 1 Relevant Info 2 Relevant Info 3 Query 1 (E.g. job title) Query 2 (E.g. salary) Combine Query Results
  15. 15. Why Mine the Web?  Enormous wealth of textual information on the Web.  Book/CD/Video stores (e.g., Amazon)  Restaurant information (e.g., Zagats)  Car prices (e.g., Carpoint)  Lots of data on user access patterns  Web logs contain sequence of URLs accessed by users  Possible to retrieve “previously unknown” information  People who ski also frequently break their leg.  Restaurants that serve sea food in California are likely to be outside San-Francisco
  16. 16. Mining the Web IR / IE System Query Documents source Ranked Documents 1. Doc1 2. Doc2 3. Doc3 . . Web Spider
  17. 17.  The Web is a huge collection of documents where many contain:  Hyper-linkHyper-link information  Access and usage information  The Web is very dynamic  Web pages are constantly being generated (removed) Unique Features of the Web Challenge: Develop new Web mining algorithms to . . . • Exploit hyper-links and access patterns. • Be adaptable to its documents source
  18. 18.  Combine the intelligent IR tools  meaningmeaning of words  orderorder of words in the query  user dependencyuser dependency for the data  authorityauthority of the source  With the unique web features  retrieve Hyper-link information  utilize Hyper-link as input Intelligent Web Search
  19. 19. What is Clustering ?  Given:  A source of textual documents  Similarity measure  e.g., how many words are common in these documents Clustering System Similarity measure Documents source Doc Do c Doc Doc Doc DocDoc Doc Doc Doc • Find: • Several clusters of documents that are relevant to each other
  20. 20. Outline  Text mining applications  Text characteristics  Text mining process  Learning methods
  21. 21. Text characteristics: Outline  Large textual data base  High dimensionality  Several input modes  Dependency  Ambiguity  Noisy data  Not well structured text
  22. 22. Text characteristics  Large textual data base  Efficiency consideration  over 2,000,000,000 web pages  almost all publications are also in electronic form  High dimensionality (Sparse input)  Consider each word/phrase as a dimension  Several input modes  e.g., Web mining: information about user is generated by semantics, browse pattern and outside knowledgebase.
  23. 23. Text characteristics  Dependency  relevant information is a complex conjunction of words/phrases  e.g., Document categorization. Pronoun disambiguation.  Ambiguity  Word ambiguity  Pronouns (he, she …)  “buy”, “purchase”  Semantic ambiguity  The king saw the rabbit with his glasses.
  24. 24. Text characteristics  Noisy data  Example: Spelling mistakes  Not well structured text  Chat rooms  “r u available ?”  “Hey whazzzzzz up”  Speech
  25. 25. Outline  Text mining applications  Text characteristics  Text mining process  Learning methods
  26. 26. Text mining process
  27. 27. Text mining process  Text preprocessing  Syntactic/Semantic text analysis  Features Generation  Bag of words  Features Selection  Simple counting  Statistics  Text/Data Mining  Classification- Supervised learning  Clustering- Unsupervised learning  Analyzing results
  28. 28.  Part Of Speech (pos) tagging  Find the corresponding pos for each word e.g., John (noun) gave (verb) the (det) ball (noun)  ~98% accurate.  Word sense disambiguation  Context basedContext based or proximity basedproximity based  Very accurate  Parsing  Generates a parse treeparse tree (graph) for each sentence  Each sentence is a stand alone graph Syntactic / Semantic text analysis
  29. 29.  Given: a collection of labeled records (training settraining set)  Each record contains a set of features (attributesattributes), and the true class (labellabel)  Find: a modelmodel for the class as a function of the values of the features  Goal: previously unseen records should be assigned a class as accurately as possible  A test settest set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it Text Mining: Classification definition
  30. 30. Similarity Measures: • Euclidean DistanceEuclidean Distance if attributes are continuous • Other Problem-specific Measures • e.g., how many words are common in these documents  Given: a set of documents and a similarity measuresimilarity measure among documents  Find: clusters such that:  Documents in one cluster are more similar to one another  Documents in separate clusters are less similar to one another  Goal:  Finding a correctcorrect set of documents Text Mining: Clustering definition
  31. 31.  Supervised learning (classification)  Supervision: The training data (observations, measurements, etc.) are accompanied by labelslabels indicating the class of the observations  New data is classified based on the training set  Unsupervised learning (clustering)  The class labels of training data is unknown  Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data Supervised vs. Unsupervised Learning
  32. 32.  Correct classification: The known label of test sample is identical with the class resultclass result from the classification model  Accuracy ratio: the percentage of test set samples that are correctly classified by the model  A distance measuredistance measure between classes can be used  e.g., classifying “football” document as a “basketball” document is not as bad as classifying it as “crime”. Evaluation:What Is Good Classification?
  33. 33.  Good clustering method: produce high quality clusters with . . .  high intra-classintra-class similarity  low inter-classinter-class similarity  The qualityquality of a clustering method is also measured by its ability to discover some or all of the hiddenhidden patterns Evaluation: What Is Good Clustering?
  34. 34. Outline  Text mining applications  Text characteristics  Text mining process  Learning methods  Classification  Clustering
  35. 35. Classification: An Example Ex# Country Marital Status Income Hooligan 1 England Single 125K Yes 2 England Married Yes 3 England Single 70K Yes 4 Italy Married 40K No 5 USA Divorced 95K No 6 England Married 60K Yes 7 England 20K Yes 8 Italy Single 85K Yes 9 France Married 75K No 10 Denmark Single 50K No 10 categorical categorical continuous class Training Set Model Learn Classifier Country Marital Status Income Hooligan England Single 75K ? Turkey Married 50K ? England Married 150K ? Divorced 90K ? Single 40K ? Itlay Married 80K ? 10 Test Set
  36. 36. Text Classification: An Example Ex# Hooligan 1 An English football fan … Yes 2 During a game in Italy … Yes 3 England has been beating France … Yes 4 Italian football fans were cheering … No 5 An average USA salesman earns 75K No 6 The game in London was horrific Yes 7 Manchester city is likely to win the championship Yes 8 Rome is taking the lead in the football league Yes 10 class Training Set Model Learn Classifier text Test Set Hooligan A Danish football fan ? Turkey is playing vs. France. The Turkish fans … ? 10
  37. 37. Classification Techniques  Instance-Based Methods  Decision trees  Neural networks  Bayesian classification
  38. 38.  Instance-based (memory based) learning  Store training examples and delay the processing (“lazy evaluation”) until a new instance must be classified  k-nearest neighbor approach  InstancesInstances (Examples) are represented as points in a Euclidean spacepoints in a Euclidean space Instance-based Methods
  39. 39. football Italian The English footballfootball fan is a hooligan. . . football Italian Similar to his English equivalent, the ItalianItalian footballfootball fan is a hooligan. . . Text Examples in Euclidean Space
  40. 40.  All instances correspond to points in the nn- D space  The nearest neighbor are defined in terms of Euclidean distance . _ + + ? + _ _ + _ _ + _ + + + + _ _ + _ _ + • The kk-NN-NN returns the most common value among the kk nearest training examples • Voronoi diagram: the decision surface induced by 11-NN-NN for a typical set of training examples K-Nearest Neighbor Algorithm
  41. 41. Classification Techniques  Instance-Based Methods  Decision trees  Neural networks  Bayesian classification
  42. 42. Ex# Country Marital Status Income Hooligan 1 England Single 125K Yes 2 England Married 100K Yes 3 England Single 70K Yes 4 Italy Married 40K No 5 USA Divorced 95K No 6 England Married 60K Yes 7 England Divorced 20K Yes 8 Italy Single 85K Yes 9 France Married 75K No 10 Denmark Single 50K No 10 categorical categorical continuous class Decision Tree: An Example Yes English Yes No MarSt NO MarriedSingle, Divorced Splitting Attributes Income YESNO > 80K < 80K The splitting attribute at a node is determined based on a specific Attribute selection algorithm
  43. 43. Ex# Hooligan 1 An English football fan … Yes 2 During a game in Italy … Yes 3 England has been beating France … Yes 4 Italian football fans were cheering … No 5 An average USA salesman earns 75K No 6 The game in London was horrific Yes 7 Manchester city is likely to win the championship Yes 8 Rome is taking the lead in the football league Yes 10 class text Decision Tree: A Text Example Yes English Yes No MarSt NO MarriedSingle, Divorced Splitting Attributes Income YESNO > 80K < 80K The splitting attribute at a node is determined based on a specific Attribute selection algorithm
  44. 44.  Decision tree  A flow-chart-like tree structure  Internal node denotes a test on an attribute  Branch represents an outcome of the test  Leaf nodes represent class labels or class distribution  Decision tree generation consists of two phases:  Tree construction  Tree pruning  Identify and remove branches that reflect noisenoise or outliersoutliers  Use of decision tree: Classifying an unknown sample  Test the attribute of the sample against the decision tree Classification by DT Induction
  45. 45.  Partitioning Methods  Hierarchical Methods Clustering Techniques
  46. 46.  Partitioning method: Construct a partition of n documents into a set of k clusters  Given: a set of documents and the number k  Find: a partition of k clusters that optimizes the chosen partitioning criterion  Global optimalGlobal optimal: exhaustively enumerate all partitions  Heuristic methods: k-means and k-medoids algorithms  k-meansk-means: Each cluster is represented by the center of the cluster Partitioning Algorithms
  47. 47.  k-means algorithm is implemented in 4 steps: 1. Partition objects into kk nonempty subsets. 2. Compute seed points as the centroidscentroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster. 3. Assign each object to the cluster with the nearest seed point. 4. Go back to Step 2, stop when no more new assignment. The K-means Clustering Method
  48. 48. 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 The K-means Clustering: Example
  49. 49.  Partitioning Methods  Hierarchical Methods Clustering Techniques
  50. 50.  Agglomerative:  Start with each document being a single cluster.  Eventually all document belong to the same cluster.  Divisive:  Start with all document belong to the same cluster.  Eventually each node forms a cluster on its own.  Does not require the number of clusters k in advance  Needs a termination condition  The final mode in both Agglomerative and Divisive in of no use. Hierarchical Clustering
  51. 51. Step 0 b d c e a a b Step 1 Step 2 d e Step 3 c d e Step 4 a b c d e agglomerative Step 4 Step 3 Step 2 Step 1 Step 0 divisive Hierarchical Clustering: Example
  52. 52. • Dendrogram: Decomposes data objects into a several levels of nested partitioning (tree of clusters). • Clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connectedconnected component forms a cluster. A Dendogram: Hierarchical Clustering
  53. 53. Demo
  54. 54. Commercial Tools  IBM Intelligent Miner for Text  Semio Map  InXight LinguistX / ThingFinder  LexiQuest  ClearForest  Teragram  SRA NetOwl Extractor  Autonomy
  55. 55.  Text is tricky to process, but “ok” results are easily achieved  There exist several text mining systemstext mining systems  e.g., D2K - Data to Knowledge  http://www.ncsa.uiuc.edu/Divisions/DMV/ALG/  Additional IntelligenceIntelligence can be integrated with text mining  One may play with any phase of the text mining process Summary
  56. 56. Summary  There are many other scientific and statistical text miningscientific and statistical text mining methodsmethods developed but not covered in this talk.  http://www.cs.utexas.edu/users/pebronia/text-mining/  http://filebox.vt.edu/users/wfan/text_mining.html  Also, it is important to study theoretical foundationstheoretical foundations of data mining.  Data Mining Concepts and Techniques / J.Han & M.Kamber  Machine Learning, / T.Mitchell

×