2. The Needs:
- Find people as well as documents that can
address my information need.
- Promote collaboration and knowledge
sharing
- Leverage existing information access
system
- The Information Sources:
- Email, groupware, online reports, …
Example 1: KM People Finder
3. Example 1:
Simple KM People Finder
Relevant
Docs
Search or
Navigation
System
Name
Extractor Authority
List
Query
Ranked People Names
5. • An exploration and analysis of textual (natural-language) datatextual (natural-language) data by
automatic and semi automatic means to discover new knowledge.
Text Mining Definition
Many definitions in the literature
“The non trivial extraction of implicit, previously unknown,
and potentially useful information from (large amount of)
textual data”.
6. What is ““previously unknown”previously unknown”
information ?
Strict definition
Information that not even the writer knows.
e.g., Discovering a new method for a hair growth
that is described as a side effect for a different
procedure
Lenient definition
Rediscover the information that the author
encoded in the text
e.g., Automatically extracting a product’s name
from a web-page.
Text Mining Definition
7. Outline
Text mining applications
Text characteristics
Text mining process
Learning methods
8. Text Mining Applications
Marketing: Discover distinct
groups of potential buyers
according to a user text based
profile
e.g. amazon
Industry: Identifying groups
of competitors web pages
e.g., competing products
and their prices
Job seeking: Identify
parameters in searching for
jobs
e.g., www.flipdog.com
9. Information Retrieval
Indexing and retrieval of textual documents
Information Extraction
Extraction of partial knowledgepartial knowledge in the text
Web Mining
Indexing and retrieval of textual documents
and extraction of partial knowledge using the
web
Clustering
Generating collections of similar text
documents
Text Mining Methods
10. Information Retrieval
Given:
A source of textual
documents
A user query (text
based)
IR
System
Query
E.g. Spam / Text
Documents
source
• Find:
• A set (ranked) of documents that
are relevant to the query
Ranked
Documents
Document
Document
Document
11. Intelligent Information Retrieval
meaning of words
Synonyms “buy” / “purchase”
Ambiguity “bat” (baseball vs. mammal)
order of words in the query
hot dog stand in the amusement park
hot amusement stand in the dog park
user dependency for the data
direct feedback
indirect feedback
authority of the source
IBM is more likely to be an authorized source then
my second far cousin
12. Given:
A source of textual documents
A well defined limited query (text based)
Find:
Sentences with relevantrelevant information
Extract the relevant information and
ignore non-relevant information
(important!)
Link related information and output in a
predetermined format
What is Information Extraction?
13. Information Extraction: Example
Salvadoran President-elect Alfredo Cristiania condemned the
terrorist killing of Attorney General Roberto Garcia Alvarado
and accused the Farabundo Marti Natinal Liberation Front
(FMLN) of the crime. … Garcia Alvarado, 56, was killed when
a bomb placed by urban guerillas on his vehicle exploded as
it came to a halt at an intersection in downtown San
Salvador. … According to the police and Garcia Alvarado’s
driver, who escaped unscathed, the attorney general was
traveling with two bodyguards. One of them was injured.
Incident Date: 19 Apr 89
Incident Type: Bombing
Perpetrator Individual ID: “urban guerillas”
Human Target Name: “Roberto Garcia Alvarado”
...
14. What is Information Extraction?
Extraction
System
Documents
source
Ranked
Documents
Relevant Info 1
Relevant Info 2
Relevant Info 3
Query 1
(E.g. job title)
Query 2
(E.g. salary)
Combine
Query Results
15. Why Mine the Web?
Enormous wealth of textual information on the Web.
Book/CD/Video stores (e.g., Amazon)
Restaurant information (e.g., Zagats)
Car prices (e.g., Carpoint)
Lots of data on user access patterns
Web logs contain sequence of URLs accessed by users
Possible to retrieve “previously unknown”
information
People who ski also frequently break their leg.
Restaurants that serve sea food in California are likely
to be outside San-Francisco
16. Mining the Web
IR / IE
System
Query
Documents
source
Ranked
Documents
1. Doc1
2. Doc2
3. Doc3
.
.
Web Spider
17. The Web is a huge collection of
documents where many contain:
Hyper-linkHyper-link information
Access and usage information
The Web is very dynamic
Web pages are constantly being
generated (removed)
Unique Features of the Web
Challenge: Develop new Web mining algorithms to . . .
• Exploit hyper-links and access patterns.
• Be adaptable to its documents source
18. Combine the intelligent IR tools
meaningmeaning of words
orderorder of words in the query
user dependencyuser dependency for the data
authorityauthority of the source
With the unique web features
retrieve Hyper-link information
utilize Hyper-link as input
Intelligent Web Search
19. What is Clustering ?
Given:
A source of
textual documents
Similarity
measure
e.g., how many
words are
common in these
documents
Clustering
System
Similarity
measure
Documents
source
Doc
Do
c
Doc
Doc
Doc
DocDoc
Doc
Doc
Doc
• Find:
• Several clusters of documents
that are relevant to each other
20. Outline
Text mining applications
Text characteristics
Text mining process
Learning methods
21. Text characteristics: Outline
Large textual data base
High dimensionality
Several input modes
Dependency
Ambiguity
Noisy data
Not well structured text
22. Text characteristics
Large textual data base
Efficiency consideration
over 2,000,000,000 web pages
almost all publications are also in electronic form
High dimensionality (Sparse input)
Consider each word/phrase as a dimension
Several input modes
e.g., Web mining: information about user is
generated by semantics, browse pattern and outside
knowledgebase.
23. Text characteristics
Dependency
relevant information is a complex conjunction of
words/phrases
e.g., Document categorization.
Pronoun disambiguation.
Ambiguity
Word ambiguity
Pronouns (he, she …)
“buy”, “purchase”
Semantic ambiguity
The king saw the rabbit with his glasses.
24. Text characteristics
Noisy data
Example: Spelling mistakes
Not well structured text
Chat rooms
“r u available ?”
“Hey whazzzzzz up”
Speech
25. Outline
Text mining applications
Text characteristics
Text mining process
Learning methods
27. Text mining process
Text preprocessing
Syntactic/Semantic
text analysis
Features Generation
Bag of words
Features Selection
Simple counting
Statistics
Text/Data Mining
Classification-
Supervised learning
Clustering-
Unsupervised
learning
Analyzing results
28. Part Of Speech (pos) tagging
Find the corresponding pos for each word
e.g., John (noun) gave (verb) the (det) ball
(noun)
~98% accurate.
Word sense disambiguation
Context basedContext based or proximity basedproximity based
Very accurate
Parsing
Generates a parse treeparse tree (graph) for each sentence
Each sentence is a stand alone graph
Syntactic / Semantic text analysis
29. Given: a collection of labeled records (training settraining set)
Each record contains a set of features (attributesattributes),
and the true class (labellabel)
Find: a modelmodel for the class as a function of the values
of the features
Goal: previously unseen records should be assigned a
class as accurately as possible
A test settest set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it
Text Mining: Classification definition
30. Similarity Measures:
• Euclidean DistanceEuclidean Distance if attributes are continuous
• Other Problem-specific Measures
• e.g., how many words are common in these documents
Given: a set of documents and a similarity measuresimilarity measure
among documents
Find: clusters such that:
Documents in one cluster are more similar to one
another
Documents in separate clusters are less similar to
one another
Goal:
Finding a correctcorrect set of documents
Text Mining: Clustering definition
31. Supervised learning (classification)
Supervision: The training data (observations,
measurements, etc.) are accompanied by labelslabels
indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc.
with the aim of establishing the existence of
classes or clusters in the data
Supervised vs. Unsupervised Learning
32. Correct classification: The known label of test
sample is identical with the class resultclass result from
the classification model
Accuracy ratio: the percentage of test set
samples that are correctly classified by the
model
A distance measuredistance measure between classes can be
used
e.g., classifying “football” document as a
“basketball” document is not as bad as
classifying it as “crime”.
Evaluation:What Is Good Classification?
33. Good clustering method: produce
high quality clusters with . . .
high intra-classintra-class similarity
low inter-classinter-class similarity
The qualityquality of a clustering method is
also measured by its ability to discover
some or all of the hiddenhidden patterns
Evaluation: What Is Good Clustering?
34. Outline
Text mining applications
Text characteristics
Text mining process
Learning methods
Classification
Clustering
35. Classification: An Example
Ex# Country Marital
Status
Income
Hooligan
1 England Single 125K Yes
2 England Married Yes
3 England Single 70K Yes
4 Italy Married 40K No
5 USA Divorced 95K No
6 England Married 60K Yes
7 England 20K Yes
8 Italy Single 85K Yes
9 France Married 75K No
10 Denmark Single 50K No
10
categorical
categorical
continuous
class
Training
Set
Model
Learn
Classifier
Country Marital
Status
Income
Hooligan
England Single 75K ?
Turkey Married 50K ?
England Married 150K ?
Divorced 90K ?
Single 40K ?
Itlay Married 80K ?
10
Test
Set
36. Text Classification: An Example
Ex#
Hooligan
1
An English football fan
…
Yes
2
During a game in Italy
…
Yes
3
England has been
beating France …
Yes
4
Italian football fans were
cheering …
No
5
An average USA
salesman earns 75K
No
6
The game in London
was horrific
Yes
7
Manchester city is likely
to win the championship
Yes
8
Rome is taking the lead
in the football league
Yes
10
class
Training
Set
Model
Learn
Classifier
text
Test
Set
Hooligan
A Danish football fan ?
Turkey is playing vs. France.
The Turkish fans …
?
10
38. Instance-based (memory based)
learning
Store training examples and delay the
processing (“lazy evaluation”) until a new
instance must be classified
k-nearest neighbor approach
InstancesInstances (Examples) are represented
as points in a Euclidean spacepoints in a Euclidean space
Instance-based Methods
39. football
Italian
The English
footballfootball fan
is a hooligan.
.
.
football
Italian
Similar to his
English equivalent,
the ItalianItalian
footballfootball fan
is a hooligan.
.
.
Text Examples in Euclidean
Space
40. All instances correspond to points in the nn-
D space
The nearest neighbor are defined in terms
of Euclidean distance
.
_
+
+ ?
+
_ _
+
_
_
+
_
+
+ +
+
_ _
+
_
_
+
• The kk-NN-NN returns the most common value among the kk nearest training
examples
• Voronoi diagram: the decision surface induced by 11-NN-NN for a typical set of
training examples
K-Nearest Neighbor Algorithm
42. Ex# Country Marital
Status
Income
Hooligan
1 England Single 125K Yes
2 England Married 100K Yes
3 England Single 70K Yes
4 Italy Married 40K No
5 USA Divorced 95K No
6 England Married 60K Yes
7 England Divorced 20K Yes
8 Italy Single 85K Yes
9 France Married 75K No
10 Denmark Single 50K No
10
categorical
categorical
continuous
class
Decision Tree: An Example
Yes
English
Yes
No
MarSt
NO
MarriedSingle, Divorced
Splitting Attributes
Income
YESNO
> 80K < 80K
The splitting attribute at a node is
determined based on a specific
Attribute selection algorithm
43. Ex#
Hooligan
1
An English football fan
…
Yes
2
During a game in Italy
…
Yes
3
England has been
beating France …
Yes
4
Italian football fans were
cheering …
No
5
An average USA
salesman earns 75K
No
6
The game in London
was horrific
Yes
7
Manchester city is likely
to win the championship
Yes
8
Rome is taking the lead
in the football league
Yes
10
class
text
Decision Tree: A Text Example
Yes
English
Yes
No
MarSt
NO
MarriedSingle, Divorced
Splitting Attributes
Income
YESNO
> 80K < 80K
The splitting attribute at a node is
determined based on a specific
Attribute selection algorithm
44. Decision tree
A flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class distribution
Decision tree generation consists of two phases:
Tree construction
Tree pruning
Identify and remove branches that reflect noisenoise or
outliersoutliers
Use of decision tree: Classifying an unknown sample
Test the attribute of the sample against the decision
tree
Classification by DT Induction
46. Partitioning method: Construct a partition of n
documents into a set of k clusters
Given: a set of documents and the number k
Find: a partition of k clusters that optimizes the
chosen partitioning criterion
Global optimalGlobal optimal: exhaustively enumerate all
partitions
Heuristic methods: k-means and k-medoids
algorithms
k-meansk-means: Each cluster is represented by the center
of the cluster
Partitioning Algorithms
47. k-means algorithm is implemented in
4 steps:
1. Partition objects into kk nonempty subsets.
2. Compute seed points as the centroidscentroids of the
clusters of the current partition. The
centroid is the center (mean point) of the
cluster.
3. Assign each object to the cluster with the
nearest seed point.
4. Go back to Step 2, stop when no more new
assignment.
The K-means Clustering Method
50. Agglomerative:
Start with each document being a single cluster.
Eventually all document belong to the same
cluster.
Divisive:
Start with all document belong to the same
cluster.
Eventually each node forms a cluster on its own.
Does not require the number of clusters k in advance
Needs a termination condition
The final mode in both Agglomerative and Divisive
in of no use.
Hierarchical Clustering
51. Step 0
b
d
c
e
a
a b
Step 1 Step 2
d e
Step 3
c d e
Step 4
a b c d e
agglomerative
Step 4 Step 3 Step 2 Step 1 Step 0
divisive
Hierarchical Clustering:
Example
52. • Dendrogram: Decomposes data
objects into a several levels of
nested partitioning (tree of
clusters).
• Clustering of the data objects is
obtained by cutting the
dendrogram at the desired level,
then each connectedconnected component
forms a cluster.
A Dendogram: Hierarchical
Clustering
54. Commercial Tools
IBM Intelligent Miner for Text
Semio Map
InXight LinguistX / ThingFinder
LexiQuest
ClearForest
Teragram
SRA NetOwl Extractor
Autonomy
55. Text is tricky to process, but “ok” results are easily
achieved
There exist several text mining systemstext mining systems
e.g., D2K - Data to Knowledge
http://www.ncsa.uiuc.edu/Divisions/DMV/ALG/
Additional IntelligenceIntelligence can be integrated with text
mining
One may play with any phase of the text mining
process
Summary
56. Summary
There are many other scientific and statistical text miningscientific and statistical text mining
methodsmethods developed but not covered in this talk.
http://www.cs.utexas.edu/users/pebronia/text-mining/
http://filebox.vt.edu/users/wfan/text_mining.html
Also, it is important to study theoretical foundationstheoretical foundations of data
mining.
Data Mining Concepts and Techniques / J.Han &
M.Kamber
Machine Learning, / T.Mitchell